Improve Swift concurrency safety #1376

nan-li · 2024-02-22T18:58:31Z

Description

One Line Summary

Add locking and synchronization to Swift codebase to prevent crashes in OneSignalOSCore and OneSignalUser.

Details

Motivation

SDK consumers have reported production crashes, and this PR aims to address the crashes that have been reported.

Scope

Adds locks to simple dictionary access in concrete models (Property & Identity).
Add synchronized to Background Tasks dictionary access which is in Obj-C.
Add dispatch queue to operation repo and property executors to serialize access to shared data.

Implementation Details

This PR can be read commit by commit.

While Objective-C has the @synchronized directive, the Swift language lacks built-in synchronization features. Some modern concurrency features have been added to Swift but are not available until iOS 13.

Most of the changes started with me creating crashes by setting off multiple threads of execution, then testing again after making the fix. In addition to seeing the crashes stop, I also checked the logs to confirm that the behaviors themselves look correct.

This PR uses 2 methods to add concurrency safety.

Locks

Most lock options are unfair, which I believe does not matter for the most part except in rare scenarios such as the app developer calling addTag and removeTag with the same value on separate threads concurrently. Unfair locks allow the OS to be more performant by reducing context switching and prioritizing threads of higher priority.
I chose to use this for smaller components like models that have just an array / dict to synchronize access to.
In this PR, they are added to the OSPropertiesModel and OSIdentityModel to manage access to their dictionaries.

Private Serial Dispatch Queues

I chose to use private serial dispatch queues for the bigger components like executors and operation repo.
The code paths are more complex in these components, the accesses to shared data feel more like bigger tasks than single enqueues and dequeues, and the Operation Repo is already flushing on a global dispatch queue anyway.
Only 1 task will run at a time and keeps in FIFO order.
In concrete executors, it is possible for the executor to be flushing while a client response is received that modifies the request array.
Additionally for the property executor, there are some code paths that enqueue and update request but does not go through the operation repo, such as updating session count at the start of a new session.
In the Operation Repo, the entire flush process runs in a single block from reading deltas, creating them into requests, to executors flushing their own queues.
These are all scenarios where multiple threads can access and manipulate shared data.

Future Work

There are additional places that may be candidates for better concurrency safety but by their nature or how they are accessed, they are shielded by way of the concurrency changes in this PR that protect them at a another layer. This PR should encompass the vast majority of crashes due to concurrency.
Getting unit tests to work. Reproducing crashes was more flaky than in the example app, even with the exact same code. It could be there are more things going on in the example app that eat resources or more threads. Or, it is also likely that the unit tests complete before all background threads are completed.

Testing

Unit testing

Added 1 unit test to reproduce an Operation Repo concurrent flushing crash, but it is very flaky and crashes only on occasion.
But it can be used as a model of how to try to reproduce crashes.
Most testing has been done manually through the Example App (see below).

Manual testing

Example App on physical iPhone 13 with iOS 17.2
The process I followed are:

Try to reproduce a crash in a particular component by executing multiple threads.
For example, to try to reproduce a crash for an OSIdentityModel:

// 1. Setup the model, store, and listener

let identityModel = OSIdentityModel(aliases: [:], changeNotifier: OSEventProducer())
let store = OSModelStore<OSIdentityModel>(changeSubscription: OSEventProducer(), storeKey: OS_IDENTITY_MODEL_STORE_KEY).registerAsUserObserver()
let listener = OSIdentityModelStoreListener(store: store)

listener.start()

store.add(id: OS_IDENTITY_MODEL_KEY, model: identityModel, hydrating: false)

// 2. Read and write aliases, read and write "special" aliases `onesignalId` and `externalId`
for num in 0...99 {
    DispatchQueue.global().async {
        identityModel.addAliases([ OS_ONESIGNAL_ID: "value" ])
        identityModel.addAliases([ "alias\(num)": "value" ])
        identityModel.addAliases([ "alias\(num)": "value" ])
        identityModel.addAliases([ "alias\(num)": "value" ])
        identityModel.externalId
        identityModel.onesignalId
        identityModel.addAliases([ OS_ONESIGNAL_ID: "value" ])
        identityModel.addAliases([ "alias\(num)": "value" ])
        identityModel.addAliases([ "alias\(num)": "value" ])
        identityModel.removeAliases([OS_ONESIGNAL_ID])
        identityModel.externalId
        identityModel.onesignalId
        identityModel.addAliases([ "alias\(num)": "value" ])
        identityModel.addAliases([ "alias\(num)": "value" ])
    }
}

Another example, to try to reproduce the Operation Repo crashing:

// 1. Enqueue 10 Deltas to the Operation Repo
for num in 0...9 {
    OneSignalUserManagerImpl.sharedInstance.addTag(key: "tag\(num)", value: "value")
}

// 2. Flush the delta queue from 10 multiple threads concurrently
for _ in 0...9 {
    DispatchQueue.global().async {
        OSOperationRepo.sharedInstance.addFlushDeltaQueueToDispatchQueue()
    }
}

// 3. Add tags and flush at random intervals to confirm behavior is still as expected
for _ in 0...99 {
    DispatchQueue.global().asyncAfter(deadline: .now() + .milliseconds(.random(in: 0...4000))) {
        OSOperationRepo.sharedInstance.addFlushDeltaQueueToDispatchQueue()
    }
}
for num in 0...99 {
    DispatchQueue.global().asyncAfter(deadline: .now() + .milliseconds(.random(in: 0...4000))) {
        OneSignalUserManagerImpl.sharedInstance.addTag(key: "tag\(num)", value: "value")
    }
}

Make a change in that component
Try to reproduce the crash again and confirm it no longer crashes
Additionally, it is not sufficient to just "not crash", but also check that behavior is as expected when there are multiple threads of execution.
Repeat for other components.

Affected code checklist

Checklist

Overview

I have filled out all REQUIRED sections above
PR does one thing
Any Public API changes are explained in the PR details and conform to existing APIs

Testing

I have included test coverage for these changes, or explained why they are not needed
All automated tests pass, or I explained why that is not possible
I have personally tested this on my device, or explained why that is not possible

Final pass

Code is as readable as possible.
I have reviewed this PR myself, ensuring it meets each checklist item

This change is

* Add lock around appending to the queue * Add lock around entire logic of flushing the delta queue, including enqueuing and flushing in the executors

There was a report of a crash related to this dictionary and while I have not been able to reproduce by hammering many threads to call methods manipulating the tasks dictionary, access to them should be synchronized.

I was able to create crashes by hammering threads to add tags concurrently. Adding the lock prevented crashes, and behavior seems as expected. Unfortunately, the crashes were very flaky when trying to reproduce in tests, they would crash in much less than 50% of runs. Producing the crashes in the example app was much more consistent. Hence why there is no unit test to reproduce this.

I was able to create crashes by hammering threads to add aliases and get aliases concurrently. Adding the lock prevented crashes, and behavior seems as expected. I also had to lock the getters for onesignalId and externalId Unfortunately, the crashes were very flaky when trying to reproduce in tests, they would crash in much less than 50% of runs. Producing the crashes in the example app was much more consistent. Hence why there is no unit test to reproduce this.

* Reproduced crashes by creating multiple async threads that added and removed tags concurrently. * Added a private dispatch queue to synchronize access to the delta queue and request queue. * Crashes no longer happened after this change. * It is possible for the executor to be flushing while a client response is received and modify the request queue. * Additionally, there are some code paths that enqueue and update request but does not go through the operation repo, such as updating session count at the start of a new session.

* In a previous commit, an unfair lock was used to synchronize access to the delta queue and synchronize flushing behavior * A dispatch queue seems more appropriate for the Operation Repo to use considering it already polls and flushes on a global queue. * Without the lock or dispatch queue, I reproduced crashes by creating multiple async threads that either added tags or called to flush the operation repo. * With the dispatch queue, those crashes do no happen and behavior seems as expected.

nan-li changed the title ~~WIP Fix swift concurrency issues~~ [WIP] 🚧 Fix swift concurrency issues Feb 22, 2024

nan-li force-pushed the fix_swift_concurrency_issues branch 2 times, most recently from 1d2cdea to 09272b4 Compare February 22, 2024 20:12

nan-li added 7 commits March 4, 2024 09:27

Add a unit test to reproduce OpRepo crash

a554e30

Add a Lock class

d42c494

Add lock around accessing OpRepo delta queue

96b1ad7

* Add lock around appending to the queue * Add lock around entire logic of flushing the delta queue, including enqueuing and flushing in the executors

Synchronize the background tasks dictionary

93435a6

There was a report of a crash related to this dictionary and while I have not been able to reproduce by hammering many threads to call methods manipulating the tasks dictionary, access to them should be synchronized.

nan-li force-pushed the fix_swift_concurrency_issues branch from b4dc632 to 981aba6 Compare March 4, 2024 18:24

nan-li force-pushed the fix_swift_concurrency_issues branch from 981aba6 to c45c081 Compare March 4, 2024 19:15

nan-li changed the title ~~[WIP] 🚧 Fix swift concurrency issues~~ Improve Swift concurrency safety Mar 4, 2024

nan-li requested review from emawby, jkasten2, jinliu9508, shepherd-l and jennantilla March 4, 2024 20:41

nan-li assigned emawby Mar 4, 2024

jkasten2 approved these changes Mar 4, 2024

View reviewed changes

nan-li merged commit 9fba0fc into main Mar 5, 2024
4 checks passed

nan-li deleted the fix_swift_concurrency_issues branch March 5, 2024 17:23

nan-li mentioned this pull request Mar 6, 2024

Release 5.1.2 OneSignal/OneSignal-Flutter-SDK#854

Merged

nan-li mentioned this pull request Jun 25, 2024

[Bug] Add Dispatch Queues to all executors #1454

Merged

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Swift concurrency safety #1376

Improve Swift concurrency safety #1376

nan-li commented Feb 22, 2024 •

edited

Loading

Improve Swift concurrency safety #1376

Improve Swift concurrency safety #1376

Conversation

nan-li commented Feb 22, 2024 • edited Loading

Description

One Line Summary

Details

Motivation

Scope

Implementation Details

Future Work

Testing

Unit testing

Manual testing

Affected code checklist

Checklist

Overview

Testing

Final pass

nan-li commented Feb 22, 2024 •

edited

Loading