Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: test_subgraph_cache_control hangs #3213

Closed
BrynCooke opened this issue Jun 5, 2023 · 11 comments · Fixed by #3273
Closed

Flaky test: test_subgraph_cache_control hangs #3213

BrynCooke opened this issue Jun 5, 2023 · 11 comments · Fixed by #3273
Assignees

Comments

@BrynCooke
Copy link
Contributor

running 3 tests
test tests::test_subgraph_cache_control has been running for over 60 seconds
test tests::test_subgraph_cache_control_missing has been running for over 60 seconds
test tests::test_subgraph_cache_control_public has been running for over 60 seconds
@BrynCooke BrynCooke changed the title Flaky test test_subgraph_cache_control Flaky test: test_subgraph_cache_control hangs Jun 5, 2023
@garypen
Copy link
Contributor

garypen commented Jun 5, 2023

I've filed a rhai issue to help track this:

rhaiscript/rhai#723

@schungx
Copy link

schungx commented Jun 5, 2023

A test hanging is unlikely due to a Rhai error. It can be due to a multitude of reasons.

If it is a Rhai issue due to hash collisions, then it would fail with "function not found" errors. Not hung.

@garypen
Copy link
Contributor

garypen commented Jun 5, 2023

A test hanging is unlikely due to a Rhai error. It can be due to a multitude of reasons.

If it is a Rhai issue due to hash collisions, then it would fail with "function not found" errors. Not hung.

Yep. I don't know what I was thinking. I don't think this is a rhai error, so I'll investigate other possibilities for now. I'll close down the issue on the Rhai repo.

@garypen
Copy link
Contributor

garypen commented Jun 5, 2023

Although there's definitely something odd which is related to Rhai here. I have been running the test manually in our CI environment and I did get the test to fail with:

  left: `[Error { message: "rhai execution error: 'Unknown property 'cache_control_uncacheable' - a getter is not registered for type 'apollo_router::context::Context' (line 47, position 40)\nin closure call'", locations: [], path: None, extensions: {} }]`,

Usually, it will fail by hanging and I strongly suspect the cause in those instances is a deadlock, but I can't narrow that down at the moment.

@schungx
Copy link

schungx commented Jun 6, 2023

Yes, this looks like our old error.

Can you make sure you turn on the feature flag in your CI tests.

@BrynCooke
Copy link
Contributor Author

Bumping the priority of this ticket as the test fails so often it's affecting our velocity.

@garypen
Copy link
Contributor

garypen commented Jun 14, 2023

Ok. After a reasonable amount of digging I may have found the problem...

I finally managed to reproduce the hang on my laptop. I had the same symptoms as we saw in CI:

  • cache control tests hanging
  • CPUs pinned at X * 100% user for each test which was hanging (1 <= X <= 3)

I fired up lldb and saw this interesting stack:

* thread #2, name = 'tests::test_subgraph_cache_control_public'
  * frame #0: 0x0000000106450994 cache_control-1bd591db4cc13056`core::sync::atomic::atomic_load::hdac91bb402bfb70f(dst=0x0000000108b77c68, order=SeqCst) at atomic.rs:3158:9
    frame #1: 0x0000000103c595bc cache_control-1bd591db4cc13056`core::sync::atomic::AtomicUsize::load::h191fa74c675af5bb(self=0x0000000108b77c68, order=SeqCst) at atomic.rs:2264:26
    frame #2: 0x0000000103f40ffc cache_control-1bd591db4cc13056`rhai::config::hashing::HokmaLock::write::hf6d6c85803f5f3e9(self=0x0000000108b77c68) at hashing.rs:53:28
    frame #3: 0x0000000103f411dc cache_control-1bd591db4cc13056`rhai::config::hashing::SusLock$LT$T$GT$::get::he0a900449aa741c8(self=0x0000000108b777c8) at hashing.rs:140:25
    frame #4: 0x0000000103f414a4 cache_control-1bd591db4cc13056`rhai::config::hashing::SusLock$LT$T$GT$::get_or_init::he4a96b22194cf16d(self=0x0000000108b777c8, f={closure_env#0} @ 0x000000016e1515f7) at hashing.rs:163:9
    frame #5: 0x0000000103c09bfc cache_control-1bd591db4cc13056`rhai::eval::chaining::_$LT$impl$u20$rhai..engine..Engine$GT$::call_indexer_set::h2f6728bd994cde39 [inlined] rhai::eval::chaining::hash_idx::h118100b858a2bed7 at chaining.rs:27:6
    frame #6: 0x0000000103c09bf0 cache_control-1bd591db4cc13056`rhai::eval::chaining::_$LT$impl$u20$rhai..engine..Engine$GT$::call_indexer_set::h2f6728bd994cde39(self=0x000000015062a410, global=0x000000016e15d920, caches=0x000000016e15d9a8, target=0x000000016e155270, idx=0x000000016e152630, new_val=0x000000016e1526f0, is_ref_mut=true, pos=(line = 55, pos = 23)) at chaining.rs:93:20

Note: the code is doing an atomic load in rhai::config::hashing::HokmaLock::write

Let's look at that source:


    pub fn write(&'static self) -> WhenTheHokmaSuppression {
        loop {
            let previous = self.lock.load(Ordering::SeqCst);
            self.lock.store(1, Ordering::SeqCst);

            if previous != 1 {
                return WhenTheHokmaSuppression {
                    hokma: self,
                    state: previous,
                };
            }
        }
    }

That immediately looks suspicious. Shouldn't that be a compare and swap? If not, what's to stop another thread from being scheduled and changing the value of previous before we examine it? (answer: nothing).

Ok, we have at least a smoking gun. Let's see if we can figure out a minimal modification and then test to see if it resolves the problem. Here's my suggested fix for the "compare and swap" problem:

    pub fn write(&'static self) -> WhenTheHokmaSuppression {
        loop {
            // We are only interested in error results
            if let Err(previous) = self
                .lock
                .compare_exchange(1, 1, Ordering::SeqCst, Ordering::SeqCst)
            {
                // If we failed, previous cannot be 1
                return WhenTheHokmaSuppression {
                    hokma: self,
                    state: previous,
                };
            }
        }
    }

Note: We could probably do a more comprehensive fix here involving thread yielding and re-working the logic or perhaps just using an off the shelf spinlock, but anyway this is a minimally functional change.

I tested this by creating a test branch of the router which used my github code and then executing:

  • multiple runs of the rhai test suite (10s of times)
  • multiple runs of the failing test on my laptop
  • 5 runs of our complete test suite in our CI environment

All tests passed and I had 0 hangs. Previously, I was unable to get the CI test suite to run more than twice without hanging.

I'm reasonably satisfied at this point that if this isn't a 100% comprehensive fix to the hanging problem it at least improves the situation. I'll raise a PR against Rhai and reference this issue.

@schungx
Copy link

schungx commented Jun 15, 2023

Great find!!! I've always suspected something is amiss there but I'm not that much of an expert in this area to really fix it.

Thanks for digging into it!

@garypen garypen self-assigned this Jun 15, 2023
@garypen
Copy link
Contributor

garypen commented Jun 15, 2023

This will be fixed when rhai:1.15.0 is available.

@schungx
Copy link

schungx commented Jun 15, 2023

Thank you for catching this. Data races are very hard to catch and reproduce.

@abernix
Copy link
Member

abernix commented Jun 16, 2023

Thanks for the turn around, @schungx! Much appreciated!

garypen added a commit that referenced this issue Jun 18, 2023
One of our Rhai example has been regularly hanging in the CI builds for
the last couple of months. Investigation uncovered a race condition
within Rhai itself. This update brings in the fixed version of Rhai and
should eliminate the hanging problem.

fixes: #3213
garypen added a commit that referenced this issue Jun 19, 2023
One of our Rhai examples' tests have been regularly hanging in the CI
builds for the last couple of months. Investigation uncovered a race
condition within Rhai itself. This update brings in the fixed version of
Rhai and should eliminate the hanging problem.

fixes: #3213
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants