Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable shutdown even in presence of poisoned timelines or tenants #3621

Closed
koivunej opened this issue Feb 16, 2023 · 4 comments
Closed

Enable shutdown even in presence of poisoned timelines or tenants #3621

koivunej opened this issue Feb 16, 2023 · 4 comments
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver

Comments

@koivunej
Copy link
Member

koivunej commented Feb 16, 2023

I've observed that shutdown doesn't work if layermaps have been poisoned; this was with cargo build_testing and intentional poisoning, then running tests which started to fail because of no free ports.

Initial idea was: We should have a way to poison all of the locks in timeline and tenant, then make sure in a test that these tenants remain /ignore and /load'ble, so we will gain more confidence that we are in fact able to assert! in code. This could be an http endpoint similar to always panic which will be added in #3475.

Initially we should just ensure that poisoned locks are tolerated during shutdown, then on tenant ignore (which could work already).

@koivunej
Copy link
Member Author

Alternatively panicking/poisoning could lead to automatic tenant/timeline brokeness, but unsure if that state transition is supported (broken tenants appear only during startup).

@koivunej koivunej added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Feb 20, 2023
@LizardWizzard
Copy link
Contributor

It may be a good idea to try parking-lot once again to avoid poisoning

@koivunej
Copy link
Member Author

I don't think the poisoning is the issue. We should have a way to gracefully teardown when a panic happens, because there's no knowing if any or all details of tenants/timelines with poisoned internals are okay to continue with. For example: #3869 (comment) -- if an assertion is hit, then our metrics would be off and those might get used for billing (might not be used right now). Also I have a PR regarding fixing the metrics because they are just wrong, but that now-draft predates the linked discussion (#3775).

@koivunej
Copy link
Member Author

#6373 (comment) shows how when layer flush loop dies (for whatever reason), we get stuck. However, it seems that we lost the feature of poisoning shutting down flush task. So I'll just create new issue and close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

2 participants