-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky test: test_subgraph_cache_control hangs #3213
Comments
I've filed a rhai issue to help track this: |
A test hanging is unlikely due to a Rhai error. It can be due to a multitude of reasons. If it is a Rhai issue due to hash collisions, then it would fail with "function not found" errors. Not hung. |
Yep. I don't know what I was thinking. I don't think this is a rhai error, so I'll investigate other possibilities for now. I'll close down the issue on the Rhai repo. |
Although there's definitely something odd which is related to
Usually, it will fail by hanging and I strongly suspect the cause in those instances is a deadlock, but I can't narrow that down at the moment. |
Yes, this looks like our old error. Can you make sure you turn on the feature flag in your CI tests. |
Bumping the priority of this ticket as the test fails so often it's affecting our velocity. |
Ok. After a reasonable amount of digging I may have found the problem... I finally managed to reproduce the hang on my laptop. I had the same symptoms as we saw in CI:
I fired up lldb and saw this interesting stack:
Note: the code is doing an atomic load in Let's look at that source:
That immediately looks suspicious. Shouldn't that be a compare and swap? If not, what's to stop another thread from being scheduled and changing the value of previous before we examine it? (answer: nothing). Ok, we have at least a smoking gun. Let's see if we can figure out a minimal modification and then test to see if it resolves the problem. Here's my suggested fix for the "compare and swap" problem:
Note: We could probably do a more comprehensive fix here involving thread yielding and re-working the logic or perhaps just using an off the shelf spinlock, but anyway this is a minimally functional change. I tested this by creating a test branch of the router which used my github code and then executing:
All tests passed and I had 0 hangs. Previously, I was unable to get the CI test suite to run more than twice without hanging. I'm reasonably satisfied at this point that if this isn't a 100% comprehensive fix to the hanging problem it at least improves the situation. I'll raise a PR against Rhai and reference this issue. |
Great find!!! I've always suspected something is amiss there but I'm not that much of an expert in this area to really fix it. Thanks for digging into it! |
This will be fixed when |
Thank you for catching this. Data races are very hard to catch and reproduce. |
Thanks for the turn around, @schungx! Much appreciated! |
One of our Rhai example has been regularly hanging in the CI builds for the last couple of months. Investigation uncovered a race condition within Rhai itself. This update brings in the fixed version of Rhai and should eliminate the hanging problem. fixes: #3213
One of our Rhai examples' tests have been regularly hanging in the CI builds for the last couple of months. Investigation uncovered a race condition within Rhai itself. This update brings in the fixed version of Rhai and should eliminate the hanging problem. fixes: #3213
The text was updated successfully, but these errors were encountered: