Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Swift concurrency safety #1376

Merged
merged 8 commits into from
Mar 5, 2024
Merged

Improve Swift concurrency safety #1376

merged 8 commits into from
Mar 5, 2024

Conversation

nan-li
Copy link
Contributor

@nan-li nan-li commented Feb 22, 2024

Description

One Line Summary

Add locking and synchronization to Swift codebase to prevent crashes in OneSignalOSCore and OneSignalUser.

Details

Motivation

SDK consumers have reported production crashes, and this PR aims to address the crashes that have been reported.

Scope

  • Adds locks to simple dictionary access in concrete models (Property & Identity).
  • Add synchronized to Background Tasks dictionary access which is in Obj-C.
  • Add dispatch queue to operation repo and property executors to serialize access to shared data.

Implementation Details

This PR can be read commit by commit.

While Objective-C has the @synchronized directive, the Swift language lacks built-in synchronization features. Some modern concurrency features have been added to Swift but are not available until iOS 13.

Most of the changes started with me creating crashes by setting off multiple threads of execution, then testing again after making the fix. In addition to seeing the crashes stop, I also checked the logs to confirm that the behaviors themselves look correct.

This PR uses 2 methods to add concurrency safety.

Locks

  • Most lock options are unfair, which I believe does not matter for the most part except in rare scenarios such as the app developer calling addTag and removeTag with the same value on separate threads concurrently. Unfair locks allow the OS to be more performant by reducing context switching and prioritizing threads of higher priority.
  • I chose to use this for smaller components like models that have just an array / dict to synchronize access to.
  • In this PR, they are added to the OSPropertiesModel and OSIdentityModel to manage access to their dictionaries.

Private Serial Dispatch Queues

  • I chose to use private serial dispatch queues for the bigger components like executors and operation repo.
  • The code paths are more complex in these components, the accesses to shared data feel more like bigger tasks than single enqueues and dequeues, and the Operation Repo is already flushing on a global dispatch queue anyway.
  • Only 1 task will run at a time and keeps in FIFO order.
  • In concrete executors, it is possible for the executor to be flushing while a client response is received that modifies the request array.
  • Additionally for the property executor, there are some code paths that enqueue and update request but does not go through the operation repo, such as updating session count at the start of a new session.
  • In the Operation Repo, the entire flush process runs in a single block from reading deltas, creating them into requests, to executors flushing their own queues.
  • These are all scenarios where multiple threads can access and manipulate shared data.

Future Work

  • There are additional places that may be candidates for better concurrency safety but by their nature or how they are accessed, they are shielded by way of the concurrency changes in this PR that protect them at a another layer. This PR should encompass the vast majority of crashes due to concurrency.

  • Getting unit tests to work. Reproducing crashes was more flaky than in the example app, even with the exact same code. It could be there are more things going on in the example app that eat resources or more threads. Or, it is also likely that the unit tests complete before all background threads are completed.

Testing

Unit testing

  • Added 1 unit test to reproduce an Operation Repo concurrent flushing crash, but it is very flaky and crashes only on occasion.
  • But it can be used as a model of how to try to reproduce crashes.
  • Most testing has been done manually through the Example App (see below).

Manual testing

Example App on physical iPhone 13 with iOS 17.2
The process I followed are:

  1. Try to reproduce a crash in a particular component by executing multiple threads.
    For example, to try to reproduce a crash for an OSIdentityModel:
// 1. Setup the model, store, and listener

let identityModel = OSIdentityModel(aliases: [:], changeNotifier: OSEventProducer())
let store = OSModelStore<OSIdentityModel>(changeSubscription: OSEventProducer(), storeKey: OS_IDENTITY_MODEL_STORE_KEY).registerAsUserObserver()
let listener = OSIdentityModelStoreListener(store: store)

listener.start()

store.add(id: OS_IDENTITY_MODEL_KEY, model: identityModel, hydrating: false)

// 2. Read and write aliases, read and write "special" aliases `onesignalId` and `externalId`
for num in 0...99 {
    DispatchQueue.global().async {
        identityModel.addAliases([ OS_ONESIGNAL_ID: "value" ])
        identityModel.addAliases([ "alias\(num)": "value" ])
        identityModel.addAliases([ "alias\(num)": "value" ])
        identityModel.addAliases([ "alias\(num)": "value" ])
        identityModel.externalId
        identityModel.onesignalId
        identityModel.addAliases([ OS_ONESIGNAL_ID: "value" ])
        identityModel.addAliases([ "alias\(num)": "value" ])
        identityModel.addAliases([ "alias\(num)": "value" ])
        identityModel.removeAliases([OS_ONESIGNAL_ID])
        identityModel.externalId
        identityModel.onesignalId
        identityModel.addAliases([ "alias\(num)": "value" ])
        identityModel.addAliases([ "alias\(num)": "value" ])
    }
}

Another example, to try to reproduce the Operation Repo crashing:

// 1. Enqueue 10 Deltas to the Operation Repo
for num in 0...9 {
    OneSignalUserManagerImpl.sharedInstance.addTag(key: "tag\(num)", value: "value")
}

// 2. Flush the delta queue from 10 multiple threads concurrently
for _ in 0...9 {
    DispatchQueue.global().async {
        OSOperationRepo.sharedInstance.addFlushDeltaQueueToDispatchQueue()
    }
}

// 3. Add tags and flush at random intervals to confirm behavior is still as expected
for _ in 0...99 {
    DispatchQueue.global().asyncAfter(deadline: .now() + .milliseconds(.random(in: 0...4000))) {
        OSOperationRepo.sharedInstance.addFlushDeltaQueueToDispatchQueue()
    }
}
for num in 0...99 {
    DispatchQueue.global().asyncAfter(deadline: .now() + .milliseconds(.random(in: 0...4000))) {
        OneSignalUserManagerImpl.sharedInstance.addTag(key: "tag\(num)", value: "value")
    }
}
  1. Make a change in that component
  2. Try to reproduce the crash again and confirm it no longer crashes
  3. Additionally, it is not sufficient to just "not crash", but also check that behavior is as expected when there are multiple threads of execution.
  4. Repeat for other components.

Affected code checklist

  • Notifications
    • Display
    • Open
    • Push Processing
    • Confirm Deliveries
  • Outcomes
  • Sessions
  • In-App Messaging
  • REST API requests
  • Public API changes

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing
  • Any Public API changes are explained in the PR details and conform to existing APIs

Testing

  • I have included test coverage for these changes, or explained why they are not needed
  • All automated tests pass, or I explained why that is not possible
  • I have personally tested this on my device, or explained why that is not possible

Final pass

  • Code is as readable as possible.
  • I have reviewed this PR myself, ensuring it meets each checklist item

This change is Reviewable

@nan-li nan-li changed the title WIP Fix swift concurrency issues [WIP] 🚧 Fix swift concurrency issues Feb 22, 2024
@nan-li nan-li force-pushed the fix_swift_concurrency_issues branch 2 times, most recently from 1d2cdea to 09272b4 Compare February 22, 2024 20:12
* Add lock around appending to the queue
* Add lock around entire logic of flushing the delta queue, including enqueuing and flushing in the executors
There was a report of a crash related to this dictionary and while I have not been able to reproduce by hammering many threads to call methods manipulating the tasks dictionary, access to them should be synchronized.
I was able to create crashes by hammering threads to add tags concurrently. Adding the lock prevented crashes, and behavior seems as expected.
Unfortunately, the crashes were very flaky when trying to reproduce in tests, they would crash in much less than 50% of runs. Producing the crashes in the example app was much more consistent. Hence why there is no unit test to reproduce this.
I was able to create crashes by hammering threads to add aliases and get aliases concurrently. Adding the lock prevented crashes, and behavior seems as expected. I also had to lock the getters for onesignalId and externalId

Unfortunately, the crashes were very flaky when trying to reproduce in tests, they would crash in much less than 50% of runs. Producing the crashes in the example app was much more consistent. Hence why there is no unit test to reproduce this.
* Reproduced crashes by creating multiple async threads that added and removed tags concurrently.
* Added a private dispatch queue to synchronize access to the delta queue and request queue.
* Crashes no longer happened after this change.
* It is possible for the executor to be flushing while a client response is received and modify the request queue.
* Additionally, there are some code paths that enqueue and update request but does not go through the operation repo, such as updating session count at the start of a new session.
@nan-li nan-li force-pushed the fix_swift_concurrency_issues branch from b4dc632 to 981aba6 Compare March 4, 2024 18:24
* In a previous commit, an unfair lock was used to synchronize access to the delta queue and synchronize flushing behavior
* A dispatch queue seems more appropriate for the Operation Repo to use considering it already polls and flushes on a global queue.
* Without the lock or dispatch queue, I reproduced crashes by creating multiple async threads that either added tags or called to flush the operation repo.
* With the dispatch queue, those crashes do no happen and behavior seems as expected.
@nan-li nan-li force-pushed the fix_swift_concurrency_issues branch from 981aba6 to c45c081 Compare March 4, 2024 19:15
@nan-li nan-li changed the title [WIP] 🚧 Fix swift concurrency issues Improve Swift concurrency safety Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants