-
-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: fix thread sanitizer job #3303
Conversation
Current status: "unexpected exit, crash, or test timeout" when run under TSAN in the following test cases
The message appears after a TSAN data race warning in each of them, along with this message (not sure if it's a red herring): |
Right now I'm thinking we can merge this, and after merging main into #3290 et al, see if any of those failures resolve, then we can skip the remaining failures and open subsequent issues to investigate/fix them. What do you think @brustolin ? |
I changed #3290 to merge into this branch and reran all the unit tests against it, so let's see if any ANRTracker-based tsan warnings pop up there. |
Codecov Report
@@ Coverage Diff @@
## main #3303 +/- ##
=============================================
+ Coverage 89.276% 89.347% +0.071%
=============================================
Files 500 500
Lines 54739 54862 +123
Branches 19644 19682 +38
=============================================
+ Hits 48869 49018 +149
+ Misses 5002 4972 -30
- Partials 868 872 +4
... and 14 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
The Thread Sanitizer CI job is still failing. @armcknight, what's the state of the PR? |
Performance metrics 🚀
|
Revision | Plain | With Sentry | Diff |
---|---|---|---|
a176fc4 | 1250.29 ms | 1257.88 ms | 7.59 ms |
257c2a9 | 1231.45 ms | 1252.12 ms | 20.67 ms |
6943de0 | 1237.67 ms | 1247.12 ms | 9.45 ms |
7419285 | 1209.53 ms | 1244.72 ms | 35.19 ms |
fd6a31c | 1204.73 ms | 1222.34 ms | 17.61 ms |
98cca71 | 1199.08 ms | 1227.36 ms | 28.28 ms |
9cc7e7c | 1228.90 ms | 1237.96 ms | 9.06 ms |
1db04d8 | 1250.20 ms | 1258.12 ms | 7.92 ms |
afd1a08 | 1207.78 ms | 1223.44 ms | 15.66 ms |
d8eb419 | 1221.91 ms | 1253.62 ms | 31.71 ms |
App size
Revision | Plain | With Sentry | Diff |
---|---|---|---|
a176fc4 | 22.84 KiB | 403.24 KiB | 380.39 KiB |
257c2a9 | 20.76 KiB | 401.36 KiB | 380.60 KiB |
6943de0 | 20.76 KiB | 393.33 KiB | 372.57 KiB |
7419285 | 20.76 KiB | 432.99 KiB | 412.22 KiB |
fd6a31c | 20.76 KiB | 436.50 KiB | 415.74 KiB |
98cca71 | 22.85 KiB | 411.14 KiB | 388.29 KiB |
9cc7e7c | 22.84 KiB | 403.13 KiB | 380.29 KiB |
1db04d8 | 20.76 KiB | 435.50 KiB | 414.74 KiB |
afd1a08 | 22.84 KiB | 402.57 KiB | 379.72 KiB |
d8eb419 | 22.85 KiB | 408.87 KiB | 386.02 KiB |
Previous results on branch: armcknight/ci/3200-fix-thread-sanitizer
Startup times
Revision | Plain | With Sentry | Diff |
---|---|---|---|
346c775 | 1242.31 ms | 1256.77 ms | 14.46 ms |
c58eb2f | 1235.42 ms | 1256.14 ms | 20.72 ms |
802d0eb | 1250.78 ms | 1257.14 ms | 6.36 ms |
b503810 | 1196.48 ms | 1226.68 ms | 30.20 ms |
c773f09 | 1222.34 ms | 1252.36 ms | 30.02 ms |
90da6b1 | 1230.67 ms | 1240.33 ms | 9.65 ms |
c43eb20 | 1228.26 ms | 1251.85 ms | 23.59 ms |
de0b274 | 1226.80 ms | 1250.34 ms | 23.54 ms |
ce5be3f | 1233.71 ms | 1264.20 ms | 30.49 ms |
f2bc3f1 | 1192.73 ms | 1224.49 ms | 31.76 ms |
App size
Revision | Plain | With Sentry | Diff |
---|---|---|---|
346c775 | 22.85 KiB | 413.08 KiB | 390.23 KiB |
c58eb2f | 22.85 KiB | 412.62 KiB | 389.77 KiB |
802d0eb | 22.85 KiB | 411.50 KiB | 388.65 KiB |
b503810 | 22.85 KiB | 413.00 KiB | 390.15 KiB |
c773f09 | 22.85 KiB | 411.50 KiB | 388.65 KiB |
90da6b1 | 22.85 KiB | 412.07 KiB | 389.22 KiB |
c43eb20 | 22.85 KiB | 412.07 KiB | 389.22 KiB |
de0b274 | 22.85 KiB | 412.58 KiB | 389.74 KiB |
ce5be3f | 22.85 KiB | 411.80 KiB | 388.95 KiB |
f2bc3f1 | 22.85 KiB | 411.49 KiB | 388.64 KiB |
As of the latest run in CI, it doesn't produce any TSAN warnings, but is again taking long enough to time out the job :/ Looking at the logs, I still see the following failures:
|
I've got it narrowed down to one failing test, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR could be split up into several PRs. Many changes make sense even if we don't enable the thread sanitizer in CI again. You don't have to do this as the PR is still manageable, but if it gets bigger, I would rather split it up.
@@ -60,6 +60,12 @@ + (instancetype)sharedInstance | |||
|
|||
+ (void)reset | |||
{ | |||
#if !TARGET_OS_WATCH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could solve some flaky reachability tests. IMO, this should be in an extra PR, so we can merge it right away. I don't want to steal your credit, so I leave it up to you if you want to open a PR or not. Otherwise, I can do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's exactly what it was for. In general I did plan to split up some of the fixes in here... I've just been trying to plow through anything that could possibly be causing problems in the TSAN tests for now. Easier to test the combinations together, then split them apart for discussion/historical purposes afterwards!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #3356
// We accept the tradeoff that the SDK might not be fully initialized directly after | ||
// initializing it on a background thread because scheduling the init synchronously on the main | ||
// thread could lead to deadlocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m
: I think that comment should stay directly above [SentryThreadWrapper onMainThread:^{
let filteredFrames = await self.firstFrame() | ||
waitForAsyncToRun.fulfill() | ||
XCTAssertGreaterThanOrEqual(filteredFrames, 3, "The Stacktrace must include the async callers.") | ||
} | ||
wait(for: [waitForAsyncToRun], timeout: 1) | ||
|
||
wait(for: [waitForAsyncToRun], timeout: 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m
: Increasing this to 10 looks a bit scary cause it can slow down our test suite. What made you increase this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at the test logs, I saw it taking 5-6 seconds between the log statements that define the boundary completing the async task. no idea why though 🙁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can either special case it for tsan only, or skip these completely, but that seems unwise given what they're supposed to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe tsan doesn't work well with backtrace_async
. I would rather special case them for tsan.
So, this finally got a green result on the tsan job, but I'm sure it's going to be really flaky moving forward. I think we could probably merge these changes, although I'm going to separate out some unrelated changes first. If it continues to cause problems, we should consider making TSAN jobs non-failable,, and instead only report TSAN warnings and test failures as comments/annotations on PRs instead. WDYT @philipphofmann ? |
372aa8d
to
d99c31e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this finally got a green result on the tsan job, but I'm sure it's going to be really flaky moving forward. I think we could probably merge these changes, although I'm going to separate out some unrelated changes first.
I rather not enable tsan if it's flaky cause we won't trust it and ignore it most likely.
If it continues to cause problems, we should consider making TSAN jobs non-failable,, and instead only report TSAN warnings and test failures as comments/annotations on PRs instead. WDYT @philipphofmann ?
I don't think making them non-failable is a good idea cause I fear that we are going to ignore them if they are flaky. It could be a first step though, so your efforts here are not lost, if you think getting them stable is hard to achieve.
It's either that or we decide to throw them out, which seems unwise. It was finding actual issues. And, as I noted above, someone has been skipping tests anyways, as there is a suppression file with some test cases listed to skip. It's a matter of how much effort we're willing to put into getting them stable and keeping them there. I think it'll have better signal to noise ratio if it visibly comments on PRs that there is a warning on a given line of code, vs just a failing CI job that requires us to navigate through GitHub's clunky UI for checks and rerun it over and over again. Because what I'm seeing now are mostly flaky test failures, not failures specifically due to TSAN warnings like the text files I attached to this PR description. I think that's better in any case because even if it fails, it's still a pain to track down what the TSAN failure actually was in said clunky UI, it's like 7-8 clicks. It'd be much better to have it bubble up in a comment where it's immediately visible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, let's go with GH PR comments and see how it goes 👍. CI is complaining a bit. @armcknight, I think it's best to merge this PR after making TSAN non-failable, and add the GH PR comments in an extra PR.
|
||
// observed the async task taking a long time to finish if TSAN is attached | ||
var timeout = 1 | ||
if threadSanitizerIsPresent() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, when CI is green and TSAN is non-failable.
Issue to track work for reporting TSAN warnings to PRs: #3358 |
One of the UI tests started failing with no obviously related code changes. Still investigating. |
@armcknight, that could be because of #3372. Merging main back into your branch could solve the problem. |
Every log message needs to acquire a lock to evaluate whether the logger should log it. #3303 added this logic to fix a data race the thread sanitizer job found. As the SDK only calls the configure method when starting, we don't need to put a synchronized keyword around the code that evaluates the log level. This PR replaces the synchronized keyword by ignoring the thread sanitizer.
Every log message needs to acquire a lock to evaluate whether the logger should log it. #3303 added this logic to fix a data race the thread sanitizer job found. As the SDK only calls the configure method when starting, we don't need to put a synchronized keyword around the code that evaluates the log level. This PR replaces the synchronized keyword by ignoring the thread sanitizer.
Every log message needs to acquire a lock to evaluate whether the logger should log it. #3303 added this logic to fix a data race the thread sanitizer job found. As the SDK only calls the configure method when starting, we don't need to put a synchronized keyword around the code that evaluates the log level. This PR replaces the synchronized keyword by ignoring the thread sanitizer.
Every log message needs to acquire a lock to evaluate whether the logger should log it. #3303 added this logic to fix a data race the thread sanitizer job found. As the SDK only calls the configure method when starting, we don't need to put a synchronized keyword around the code that evaluates the log level. This PR replaces the synchronized keyword by ignoring the thread sanitizer.
Every log message needs to acquire a lock to evaluate whether the logger should log it. getsentry#3303 added this logic to fix a data race the thread sanitizer job found. As the SDK only calls the configure method when starting, we don't need to put a synchronized keyword around the code that evaluates the log level. This PR replaces the synchronized keyword by ignoring the thread sanitizer.
Every log message needs to acquire a lock to evaluate whether the logger should log it. getsentry#3303 added this logic to fix a data race the thread sanitizer job found. As the SDK only calls the configure method when starting, we don't need to put a synchronized keyword around the code that evaluates the log level. This PR replaces the synchronized keyword by ignoring the thread sanitizer.
fixes #3200 and #3069
Fixed the races that were detected locally. Here are the outputs from each:
data race in ANR tracker.txt
data race in SentryNetworkTracker.isCaptureFailedRequestsEnabled.txt
data race in SentrySpan.isFinished.txt
data race in SentryThreadInspectorTests.txt
data race in TestThread (1).txt
data race in TestThread (2).txt
TODO
#skip-changelog