Enable shutdown even in presence of poisoned timelines or tenants #3621

koivunej · 2023-02-16T10:03:07Z

I've observed that shutdown doesn't work if layermaps have been poisoned; this was with cargo build_testing and intentional poisoning, then running tests which started to fail because of no free ports.

Initial idea was: We should have a way to poison all of the locks in timeline and tenant, then make sure in a test that these tenants remain /ignore and /load'ble, so we will gain more confidence that we are in fact able to assert! in code. This could be an http endpoint similar to always panic which will be added in #3475.

Initially we should just ensure that poisoned locks are tolerated during shutdown, then on tenant ignore (which could work already).

The text was updated successfully, but these errors were encountered:

koivunej · 2023-02-16T11:44:08Z

Alternatively panicking/poisoning could lead to automatic tenant/timeline brokeness, but unsure if that state transition is supported (broken tenants appear only during startup).

LizardWizzard · 2023-02-22T19:34:34Z

It may be a good idea to try parking-lot once again to avoid poisoning

koivunej · 2023-03-27T14:07:16Z

I don't think the poisoning is the issue. We should have a way to gracefully teardown when a panic happens, because there's no knowing if any or all details of tenants/timelines with poisoned internals are okay to continue with. For example: #3869 (comment) -- if an assertion is hit, then our metrics would be off and those might get used for billing (might not be used right now). Also I have a PR regarding fixing the metrics because they are just wrong, but that now-draft predates the linked discussion (#3775).

koivunej · 2024-01-22T10:16:45Z

#6373 (comment) shows how when layer flush loop dies (for whatever reason), we get stuck. However, it seems that we lost the feature of poisoning shutting down flush task. So I'll just create new issue and close this.

koivunej added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Feb 20, 2023

koivunej closed this as completed Jan 22, 2024

koivunej mentioned this issue Jan 22, 2024

Timeline: death of flush loop should propagate to waiters #6424

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable shutdown even in presence of poisoned timelines or tenants #3621

Enable shutdown even in presence of poisoned timelines or tenants #3621

koivunej commented Feb 16, 2023 •

edited

Loading

koivunej commented Feb 16, 2023

LizardWizzard commented Feb 22, 2023

koivunej commented Mar 27, 2023

koivunej commented Jan 22, 2024

Enable shutdown even in presence of poisoned timelines or tenants #3621

Enable shutdown even in presence of poisoned timelines or tenants #3621

Comments

koivunej commented Feb 16, 2023 • edited Loading

koivunej commented Feb 16, 2023

LizardWizzard commented Feb 22, 2023

koivunej commented Mar 27, 2023

koivunej commented Jan 22, 2024

koivunej commented Feb 16, 2023 •

edited

Loading