Flaky test: test_subgraph_cache_control hangs #3213

BrynCooke · 2023-06-05T07:50:37Z

running 3 tests
test tests::test_subgraph_cache_control has been running for over 60 seconds
test tests::test_subgraph_cache_control_missing has been running for over 60 seconds
test tests::test_subgraph_cache_control_public has been running for over 60 seconds

The text was updated successfully, but these errors were encountered:

garypen · 2023-06-05T09:32:25Z

I've filed a rhai issue to help track this:

rhaiscript/rhai#723

schungx · 2023-06-05T09:46:26Z

A test hanging is unlikely due to a Rhai error. It can be due to a multitude of reasons.

If it is a Rhai issue due to hash collisions, then it would fail with "function not found" errors. Not hung.

garypen · 2023-06-05T12:58:19Z

A test hanging is unlikely due to a Rhai error. It can be due to a multitude of reasons.

If it is a Rhai issue due to hash collisions, then it would fail with "function not found" errors. Not hung.

Yep. I don't know what I was thinking. I don't think this is a rhai error, so I'll investigate other possibilities for now. I'll close down the issue on the Rhai repo.

garypen · 2023-06-05T15:08:28Z

Although there's definitely something odd which is related to Rhai here. I have been running the test manually in our CI environment and I did get the test to fail with:

  left: `[Error { message: "rhai execution error: 'Unknown property 'cache_control_uncacheable' - a getter is not registered for type 'apollo_router::context::Context' (line 47, position 40)\nin closure call'", locations: [], path: None, extensions: {} }]`,

Usually, it will fail by hanging and I strongly suspect the cause in those instances is a deadlock, but I can't narrow that down at the moment.

schungx · 2023-06-06T01:36:16Z

Yes, this looks like our old error.

Can you make sure you turn on the feature flag in your CI tests.

BrynCooke · 2023-06-08T11:37:14Z

Bumping the priority of this ticket as the test fails so often it's affecting our velocity.

garypen · 2023-06-14T18:15:40Z

Ok. After a reasonable amount of digging I may have found the problem...

I finally managed to reproduce the hang on my laptop. I had the same symptoms as we saw in CI:

cache control tests hanging
CPUs pinned at X * 100% user for each test which was hanging (1 <= X <= 3)

I fired up lldb and saw this interesting stack:

* thread #2, name = 'tests::test_subgraph_cache_control_public'
  * frame #0: 0x0000000106450994 cache_control-1bd591db4cc13056`core::sync::atomic::atomic_load::hdac91bb402bfb70f(dst=0x0000000108b77c68, order=SeqCst) at atomic.rs:3158:9
    frame #1: 0x0000000103c595bc cache_control-1bd591db4cc13056`core::sync::atomic::AtomicUsize::load::h191fa74c675af5bb(self=0x0000000108b77c68, order=SeqCst) at atomic.rs:2264:26
    frame #2: 0x0000000103f40ffc cache_control-1bd591db4cc13056`rhai::config::hashing::HokmaLock::write::hf6d6c85803f5f3e9(self=0x0000000108b77c68) at hashing.rs:53:28
    frame #3: 0x0000000103f411dc cache_control-1bd591db4cc13056`rhai::config::hashing::SusLock$LT$T$GT$::get::he0a900449aa741c8(self=0x0000000108b777c8) at hashing.rs:140:25
    frame #4: 0x0000000103f414a4 cache_control-1bd591db4cc13056`rhai::config::hashing::SusLock$LT$T$GT$::get_or_init::he4a96b22194cf16d(self=0x0000000108b777c8, f={closure_env#0} @ 0x000000016e1515f7) at hashing.rs:163:9
    frame #5: 0x0000000103c09bfc cache_control-1bd591db4cc13056`rhai::eval::chaining::_$LT$impl$u20$rhai..engine..Engine$GT$::call_indexer_set::h2f6728bd994cde39 [inlined] rhai::eval::chaining::hash_idx::h118100b858a2bed7 at chaining.rs:27:6
    frame #6: 0x0000000103c09bf0 cache_control-1bd591db4cc13056`rhai::eval::chaining::_$LT$impl$u20$rhai..engine..Engine$GT$::call_indexer_set::h2f6728bd994cde39(self=0x000000015062a410, global=0x000000016e15d920, caches=0x000000016e15d9a8, target=0x000000016e155270, idx=0x000000016e152630, new_val=0x000000016e1526f0, is_ref_mut=true, pos=(line = 55, pos = 23)) at chaining.rs:93:20

Note: the code is doing an atomic load in rhai::config::hashing::HokmaLock::write

Let's look at that source:


    pub fn write(&'static self) -> WhenTheHokmaSuppression {
        loop {
            let previous = self.lock.load(Ordering::SeqCst);
            self.lock.store(1, Ordering::SeqCst);

            if previous != 1 {
                return WhenTheHokmaSuppression {
                    hokma: self,
                    state: previous,
                };
            }
        }
    }

That immediately looks suspicious. Shouldn't that be a compare and swap? If not, what's to stop another thread from being scheduled and changing the value of previous before we examine it? (answer: nothing).

Ok, we have at least a smoking gun. Let's see if we can figure out a minimal modification and then test to see if it resolves the problem. Here's my suggested fix for the "compare and swap" problem:

    pub fn write(&'static self) -> WhenTheHokmaSuppression {
        loop {
            // We are only interested in error results
            if let Err(previous) = self
                .lock
                .compare_exchange(1, 1, Ordering::SeqCst, Ordering::SeqCst)
            {
                // If we failed, previous cannot be 1
                return WhenTheHokmaSuppression {
                    hokma: self,
                    state: previous,
                };
            }
        }
    }

Note: We could probably do a more comprehensive fix here involving thread yielding and re-working the logic or perhaps just using an off the shelf spinlock, but anyway this is a minimally functional change.

I tested this by creating a test branch of the router which used my github code and then executing:

multiple runs of the rhai test suite (10s of times)
multiple runs of the failing test on my laptop
5 runs of our complete test suite in our CI environment

All tests passed and I had 0 hangs. Previously, I was unable to get the CI test suite to run more than twice without hanging.

I'm reasonably satisfied at this point that if this isn't a 100% comprehensive fix to the hanging problem it at least improves the situation. I'll raise a PR against Rhai and reference this issue.

schungx · 2023-06-15T00:26:52Z

Great find!!! I've always suspected something is amiss there but I'm not that much of an expert in this area to really fix it.

Thanks for digging into it!

garypen · 2023-06-15T14:52:29Z

This will be fixed when rhai:1.15.0 is available.

schungx · 2023-06-15T16:13:04Z

Thank you for catching this. Data races are very hard to catch and reproduce.

abernix · 2023-06-16T12:54:23Z

Thanks for the turn around, @schungx! Much appreciated!

One of our Rhai example has been regularly hanging in the CI builds for the last couple of months. Investigation uncovered a race condition within Rhai itself. This update brings in the fixed version of Rhai and should eliminate the hanging problem. fixes: #3213

One of our Rhai examples' tests have been regularly hanging in the CI builds for the last couple of months. Investigation uncovered a race condition within Rhai itself. This update brings in the fixed version of Rhai and should eliminate the hanging problem. fixes: #3213

BrynCooke changed the title ~~Flaky test test_subgraph_cache_control~~ Flaky test: test_subgraph_cache_control hangs Jun 5, 2023

garypen mentioned this issue Jun 14, 2023

change HokmaLock::write() to use atomic compare_exchange rhaiscript/rhai#725

Merged

garypen self-assigned this Jun 15, 2023

garypen mentioned this issue Jun 18, 2023

update Rhai to 1.15.0 to fix issue with hanging example test #3273

Merged

garypen closed this as completed in #3273 Jun 19, 2023

o0Ignition0o mentioned this issue Jun 20, 2023

prep release: v1.21.0 #3280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: test_subgraph_cache_control hangs #3213

Flaky test: test_subgraph_cache_control hangs #3213

BrynCooke commented Jun 5, 2023

garypen commented Jun 5, 2023

schungx commented Jun 5, 2023

garypen commented Jun 5, 2023

garypen commented Jun 5, 2023

schungx commented Jun 6, 2023

BrynCooke commented Jun 8, 2023

garypen commented Jun 14, 2023

schungx commented Jun 15, 2023

garypen commented Jun 15, 2023

schungx commented Jun 15, 2023

abernix commented Jun 16, 2023

Flaky test: test_subgraph_cache_control hangs #3213

Flaky test: test_subgraph_cache_control hangs #3213

Comments

BrynCooke commented Jun 5, 2023

garypen commented Jun 5, 2023

schungx commented Jun 5, 2023

garypen commented Jun 5, 2023

garypen commented Jun 5, 2023

schungx commented Jun 6, 2023

BrynCooke commented Jun 8, 2023

garypen commented Jun 14, 2023

schungx commented Jun 15, 2023

garypen commented Jun 15, 2023

schungx commented Jun 15, 2023

abernix commented Jun 16, 2023