Tracking slipped NPE, but in Crashlytics v11.15.0

Few months since last Tracking slipped NPE and here we go again:

same user reported;
same scenario (code that handles a lot of NPE in try/catch);
this time with Crashlytics.

Time for dissection…

TLTR (spoiler)

Crashlytics installs signals in background thread, so Signal.install() doesn’t preserve proper chain;
Due to ^^^^ it will report NPE inside try/catch as crash (and will not report crash after it);
Its Mach exception handler is not protected against simultaneous events, and will crash if this happens;
Its Signal handler is not protected against simultaneous events, will mess signal handler, and crash.

Environment

Well known and simple scenario:

Runnable NPE = () -> {
    try {
        ((String) null).equals("hello");
    } catch (NullPointerException ignored) {}
};
Thread t1 = new Thread(NPE);
Thread t2 = new Thread(NPE);
t1.start();
t2.start();

And well know setup code:

Signals.installSignals(() -> {
    FIRApp.configure();
}, true);

Case 1: Signal handler

Once Signals.installSignals is called with preservePorts = true, all Mach handlers installed inside lambda are rolled back (sometimes, but details bellow). In this case Signal handler code will handle NPE, and its code available in FIRCLSSignal.c.
It does:

FIRCLSSignalHandler - handler itself
–> FIRCLSSignalSafeRemoveHandlers – replaces signal handlers with SIG_DFL;
–> FIRCLSSignalRecordSignal – saves crash;
–> FIRCLSSignalSafeInstallPreexistingHandlers:
–> —> again calls FIRCLSSignalSafeRemoveHandlers
–> —> restores previous signal handlers;

What wrong here:

there is no multi-thread lock on handling signals, can happen simultaneous;
non-atomic and messy restore to previous signal chain: SIG_DFL first, then previous;

And there is a crash in case of simultaneous NPE events, log looks as bellow:

FIRCLSSignalHandler 11
FIRCLSSignalHandler 11
FIRCLSSignalSafeRemoveHandlers 11 
FIRCLSSignalSafeRemoveHandlers 11 
FIRCLSSignalRecordSignal
FIRCLSSignalRecordSignal
FIRCLSSignalSafeInstallPreexistingHandlers 11
<crash>

and crash is inside FIRCLSSignalRecordSignal as it is surprised by simultaneous nature of event and just kills the app (!!!!):

static void FIRCLSSignalRecordSignal(int savedErrno, siginfo_t *info, void *uapVoid) {
  if (FIRCLSContextMarkAndCheckIfCrashed()) {
    FIRCLSSDKLog("Error: aborting signal handler because crash has already occurred");
    exit(1);       /// <<<<<< CRASH HERE 
    return;
  }
}  

Case 2: Mach handler

Once Signals.installSignals is called with preservePorts = false Crashlytics will handle everything in Mach exception handler that has higher priority than signals FIRCLSMachException.c.
Due to nature of Mach message handling it is handled in one thread one-by-one and simultaneous mess as in case of FIRCLSSignalHandler but again there is no protection against simultaneous event:

FIRCLSMachExceptionServer - runs message receive loop;
FIRCLSMachExceptionDispatchMessage - handles received message;
–> FIRCLSMachExceptionUnregister - unregisters from next messages;
–> FIRCLSSignalSafeInstallPreexistingHandlers - restores signals (same non-atomic as above);
–> FIRCLSMachExceptionRecord - records crash.

Whats wrong here:

simultaneous event will produce multiple messages that will be sent to server;
server process these in single thread so no run condition will occur;
by handling first message it unsubscribes from future messages due events;
(and not expect anymore messages)
as simultaneous happened before unsubscription it will be delivered to server.
server will process unexpected second event and crash.

Crash happens in similar code due unexpected conditions (and it just kills the app):

static bool FIRCLSMachExceptionRecord(FIRCLSMachExceptionReadContext* context,
                                      MachExceptionMessage* message) {
  ...
  if (FIRCLSContextMarkAndCheckIfCrashed()) {
    FIRCLSSDKLog("Error: aborting mach exception handler because crash has already occurred\n");
    exit(1);       /// <<<<<< CRASH HERE
    return false;
  }
  ...
}

Case 3: Why at all it handles JAVAs NPE as RoboVM should be the one ?

Following setup should restore RoboVMs power:

Signals.installSignals(() -> {
    FIRApp.configure();
}, true);

But it doesn’t and randomly Crashlytics rule signals/mach. This happens due the way it installs handlers: FIRCLSContext.m

FBLPromise* FIRCLSContextInitialize(FIRCLSContextInitData* initData,
                                    FIRCLSFileManager* fileManager) {
   ...
 if (!_firclsContext.readonly->debuggerAttached) {
#if CLS_SIGNAL_SUPPORTED
    dispatch_group_async(group, queue, ^{
      FIRCLSSignalInitialize(&_firclsContext.readonly->signal);
    });
#endif

#if CLS_MACH_EXCEPTION_SUPPORTED
    dispatch_group_async(group, queue, ^{
      FIRCLSMachExceptionInit(&_firclsContext.readonly->machException);
    });
#endif

    dispatch_group_async(group, queue, ^{
      FIRCLSExceptionInitialize(&_firclsContext.readonly->exception,
                                &_firclsContext.writable->exception);
    });
  } else {
    FIRCLSSDKLog("Debugger present - not installing handlers\n");
  }   
}

Everything is async, and probably will happen when Signals.installSignals exits and will override RoboVM handlers…

Bottom line and fix:

Crashlytics is kind of strange, have reported few issues there (without any hope):

Currently, it can be workaround with simple delay:

Signals.installSignals(() -> {
    FIRApp.configure();
    // FIXME: dirty workaround -- wait till Crashlytics install handler async 
    try { Thread.sleep(200); } catch (Exception e) {}
}, true);

Few cents about my commits