You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've observed that shutdown doesn't work if layermaps have been poisoned; this was with cargo build_testing and intentional poisoning, then running tests which started to fail because of no free ports.
Initial idea was: We should have a way to poison all of the locks in timeline and tenant, then make sure in a test that these tenants remain /ignore and /load'ble, so we will gain more confidence that we are in fact able to assert! in code. This could be an http endpoint similar to always panic which will be added in #3475.
Initially we should just ensure that poisoned locks are tolerated during shutdown, then on tenant ignore (which could work already).
The text was updated successfully, but these errors were encountered:
Alternatively panicking/poisoning could lead to automatic tenant/timeline brokeness, but unsure if that state transition is supported (broken tenants appear only during startup).
I don't think the poisoning is the issue. We should have a way to gracefully teardown when a panic happens, because there's no knowing if any or all details of tenants/timelines with poisoned internals are okay to continue with. For example: #3869 (comment) -- if an assertion is hit, then our metrics would be off and those might get used for billing (might not be used right now). Also I have a PR regarding fixing the metrics because they are just wrong, but that now-draft predates the linked discussion (#3775).
#6373 (comment) shows how when layer flush loop dies (for whatever reason), we get stuck. However, it seems that we lost the feature of poisoning shutting down flush task. So I'll just create new issue and close this.
I've observed that shutdown doesn't work if layermaps have been poisoned; this was with
cargo build_testing
and intentional poisoning, then running tests which started to fail because of no free ports.Initial idea was: We should have a way to poison all of the locks in timeline and tenant, then make sure in a test that these tenants remain
/ignore
and/load
'ble, so we will gain more confidence that we are in fact able toassert!
in code. This could be an http endpoint similar to always panic which will be added in #3475.Initially we should just ensure that poisoned locks are tolerated during shutdown, then on tenant ignore (which could work already).
The text was updated successfully, but these errors were encountered: