Remove data race caused by doing sample on rum thread #1177

cltnschlosser · 2023-02-24T19:58:00Z

What and why?

Seeing this crash after upgrading from 1.12.1 to 1.15.0:

Crashed: com.datadoghq.rum-monitor
0  MyApp                          0x24da04 VitalCPUReader.readVitalData() + 29 (VitalCPUReader.swift:29)
1  MyApp                          0x24db04 protocol witness for SamplingBasedVitalReader.readVitalData() in conformance VitalCPUReader + 4366867204 (<compiler-generated>:4366867204)
2  MyApp                          0x24df7c VitalInfoSampler.takeSample() + 86 (VitalInfoSampler.swift:86)
3  MyApp                          0x20982c specialized VitalInfoSampler.init(cpuReader:memoryReader:refreshRateReader:frequency:maximumRefreshRate:) + 4366587948 (<compiler-generated>:4366587948)
4  MyApp                          0x224808 specialized RUMViewScope.init(isInitialView:parent:dependencies:identity:path:name:attributes:customTimings:startTime:serverTimeOffset:) + 4366698504 (<compiler-generated>:4366698504)
5  MyApp                          0x2083dc RUMSessionScope.startView(on:context:) + 4366582748 (RUMViewScope.swift:4366582748)
6  MyApp                          0x207a10 RUMSessionScope.process(command:context:writer:) + 144 (RUMSessionScope.swift:144)
7  MyApp                          0x18e758 RUMApplicationScope.process(command:context:writer:) + 4366083928 (<compiler-generated>:4366083928)
8  MyApp                          0x1fa8e8 closure #1 in closure #1 in RUMMonitor.process(command:) + 597 (RUMMonitor.swift:597)
9  MyApp                          0x134de0 thunk for @callee_guaranteed () -> () + 4365716960 (<compiler-generated>:4365716960)
10 MyApp                          0x134e00 thunk for @escaping @callee_guaranteed () -> () + 4365716992 (<compiler-generated>:4365716992)
11 libdispatch.dylib              0x647c8 _dispatch_client_callout + 16
12 libdispatch.dylib              0x46b54 _dispatch_lane_barrier_sync_invoke_and_complete + 52
13 MyApp                          0x1fa7bc closure #1 in RUMMonitor.process(command:) + 4366526396 (<compiler-generated>:4366526396)
14 MyApp                          0x1fdc7c partial apply for closure #1 in RUMMonitor.process(command:) + 4366539900 (<compiler-generated>:4366539900)
15 MyApp                          0x1fe540 closure #1 in RUMMonitor.process(command:)partial apply + 4366542144
16 MyApp                          0x12f904 closure #1 in DatadogCoreFeatureScope.eventWriteContext(bypassConsent:forceNewBatch:_:) + 486 (DatadogCore.swift:486)
17 MyApp                          0x12a030 closure #1 in DatadogContextProvider.read(block:) + 105 (DatadogContextProvider.swift:105)
18 MyApp                          0x392b8 thunk for @escaping @callee_guaranteed () -> () + 4364686008 (<compiler-generated>:4364686008)
19 libdispatch.dylib              0x63850 _dispatch_call_block_and_release + 24
20 libdispatch.dylib              0x647c8 _dispatch_client_callout + 16
21 libdispatch.dylib              0x3f854 _dispatch_lane_serial_drain$VARIANT$armv81 + 604
22 libdispatch.dylib              0x402e4 _dispatch_lane_invoke$VARIANT$armv81 + 380
23 libdispatch.dylib              0x4a000 _dispatch_workloop_worker_thread + 612
24 libsystem_pthread.dylib        0x1b50 _pthread_wqthread + 284
25 libsystem_pthread.dylib        0x167c start_wqthread + 8

My initial suspicion was integer overflow with the UInt32s being used here (natural_t), but I tried to reproduce that and I think it would result in a slightly different stacktrace. So then I noticed this comment in VitalCPUReader:

    // TODO: RUMM-1276 appWillResignActive&appDidBecomeActive are called in main thread
    // IF readVitalData() is called from non-main threads, they must be synchronized

And I realized that it's crashing on the rum thread.

How?

Remove initial sample call that was happening on rum thread.
This passes current unit tests.

Alternatively this call can be moved to Runloop.main.perform {}, but that caused a test failure (could just be a test issue), so I went with this approach for now.

Review checklist

Feature or bugfix MUST have appropriate tests (unit, integration) - Existing tests pass
Make sure each commit and the PR mention the Issue number or JIRA reference
Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

Run unit tests
Run integration tests
Run smoke tests

cltnschlosser · 2023-02-24T19:59:02Z

CC @ncreated and @maxep you were both helpful before :)

maciejburda · 2023-02-28T18:27:45Z

@cltnschlosser Thanks for contributing and explaining the issue so well!

We decided to follow the other approach you suggested and update the unit tests afterwards. We'll release this as a hotfix release as soon as possible. I'll keep the opened issue posted with the updates.

maciejburda · 2023-02-28T18:29:17Z

@cltnschlosser Thanks for contributing and explaining the issue so well!

We decided to follow the other approach you suggested, and update the unit tests afterwards. This will ensure we start gathering the vitals as soon as we instantiate the object. We'll release this as a hotfix as soon as possible.

cltnschlosser · 2023-02-28T21:00:40Z

Looks like #1181 is actually crashing due to the overflow, so you'll want to fix both issues (Maybe memory issue is causing the overflow, not sure)

… ticks.

cltnschlosser · 2023-02-28T22:23:59Z

@maciejburda I updated this diff to take the initial sample, fixed the broken test, added a new test as well. Also I changed the internals of VitalCPUReader to use UInt64 instead of natural_t(UInt32). Although it's a bit unclear if that will even help. According to the stacktrace in #1181 (which is more detailed than anything I have) it looks like this happens on line 24. If that's true let ongoingInactiveTicks = ticks - (utilizedTicksWhenResigningActive ?? ticks) it would be this underflowing, and not an overflow. IE utilizedTicksWhenResigningActive > ticks. Can't find create documentation on the cpu ticks stuff in Darwin, but maybe that value just overflows and starts over at 0?

EDIT: It could also just be the memory corruption / data race issue causing utilizedTicksWhenResigningActive to have a bad value. And the other changes here would fix the overflow(underflow) issue as well.

maciejburda

I was about to jump on this issue, but I can see you've got it all right!
Thanks a ton for the great contribution (again!).

I feel like it's more likely that we are dealing with memory corruption / data race.
I'll make sure it's merged and released asap.

cltnschlosser · 2023-03-01T14:51:56Z

Unable to reproduce this locally:

/Users/vagrant/git/Tests/DatadogTests/Datadog/RUM/RUMVitals/VitalInfoSamplerTests.swift:38 - XCTAssertGreaterThan failed: ("1") is not greater than ("1")
/Users/vagrant/git/Tests/DatadogTests/Datadog/RUM/RUMVitals/VitalInfoSamplerTests.swift:40 - XCTAssertGreaterThan failed: ("1") is not greater than ("1")

When I run it locally the value is 6. Not sure what's happening, if this is 1 then other tests should be failing too.

maciejburda · 2023-03-01T14:56:40Z

Same, passes both locally and when using local CI CLI. Maybe we can try increasing the wait time? 🤔

I can take a closer look later today.

maciejburda · 2023-03-01T16:29:17Z

Made some experiments and this change seems to do the trick:
8948351

ncreated

Looks good 👍, but please make sure the test is not flaky before merging it.

Tests/DatadogTests/Datadog/RUM/RUMVitals/VitalInfoSamplerTests.swift

Remove data race caused by doing sample on rum thread

d4cfd61

cltnschlosser requested a review from a team as a code owner February 24, 2023 19:58

maciejburda mentioned this pull request Feb 28, 2023

Crash in VitalCPUReader.swift: arithmetic overflow #1181

Closed

Take initial sample. Also uses UInt64 for keeping track of background…

95e0234

… ticks.

maciejburda approved these changes Mar 1, 2023

View reviewed changes

maciejburda mentioned this pull request Mar 1, 2023

PR 1177 independent build tests #1185

Closed

Change way of triggering timer's completion block

24382be

ncreated approved these changes Mar 1, 2023

View reviewed changes

Tests/DatadogTests/Datadog/RUM/RUMVitals/VitalInfoSamplerTests.swift Show resolved Hide resolved

maciejburda merged commit d71b629 into DataDog:develop Mar 1, 2023

maciejburda mentioned this pull request Mar 1, 2023

Release 1.16.0 #1186

Merged

3 tasks

cltnschlosser mentioned this pull request Mar 17, 2023

EXC_BAD_ACCESS: thunk for @escaping @callee_guaranteed @Sendable () -> () --- [__NSCFTimer fire] #1213

Closed

maciejburda mentioned this pull request Mar 23, 2023

Change initial sample collecting #1216

Merged

6 tasks

ncreated mentioned this pull request Mar 31, 2023

Dogfood recent changes #1232

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove data race caused by doing sample on rum thread #1177

Remove data race caused by doing sample on rum thread #1177

cltnschlosser commented Feb 24, 2023

cltnschlosser commented Feb 24, 2023

maciejburda commented Feb 28, 2023

maciejburda commented Feb 28, 2023

cltnschlosser commented Feb 28, 2023

cltnschlosser commented Feb 28, 2023 •

edited

Loading

maciejburda left a comment •

edited

Loading

cltnschlosser commented Mar 1, 2023

maciejburda commented Mar 1, 2023 •

edited

Loading

maciejburda commented Mar 1, 2023

ncreated left a comment

Remove data race caused by doing sample on rum thread #1177

Remove data race caused by doing sample on rum thread #1177

Conversation

cltnschlosser commented Feb 24, 2023

What and why?

How?

Review checklist

Custom CI job configuration (optional)

cltnschlosser commented Feb 24, 2023

maciejburda commented Feb 28, 2023

maciejburda commented Feb 28, 2023

cltnschlosser commented Feb 28, 2023

cltnschlosser commented Feb 28, 2023 • edited Loading

maciejburda left a comment • edited Loading

Choose a reason for hiding this comment

cltnschlosser commented Mar 1, 2023

maciejburda commented Mar 1, 2023 • edited Loading

maciejburda commented Mar 1, 2023

ncreated left a comment

Choose a reason for hiding this comment

cltnschlosser commented Feb 28, 2023 •

edited

Loading

maciejburda left a comment •

edited

Loading

maciejburda commented Mar 1, 2023 •

edited

Loading