Pageserver crash when tenant detach is in progress can lead to detach never making progress (possible inconsistency on startup) #4284
Labels
c/storage/pageserver
Component: storage: pageserver
t/bug
Issue Type: Bug
triaged
bugs that were already triaged
Separated from #2238, and #2238 (comment) in particular.
The sequence:
Console calls delete, pageserver starts to delete files, then crashes and restarts. Because file deletion order is not specified and there no special detach marker which can prevent pageserver from loading this tenant during startup we can have half broken tenant that wont be deleted by following retry from console.
We'll have mark file for deletion of a tenant as described in the RFC. See https://github.com/zenithdb/zenith/blob/4158e24e60d294e0f039395ea95dd87f8ab317d9/docs/rfcs/022-pageserver-delete-from-s3.md#L76
Having as a separate issue to not forget about it and create separate tests for this case.
After #4855 this is only relevant for detach. Timeline delete/tenant delete now removes file in specific order. Interrupted operations now can be safely resumed.
There is one more possible glitch, we have
ignored
mark for tenants and combining it with unspecified order of deletion infs::remove_dir_all
its is possible that the mark will be removed first, then pageserver crashes and after that tenant leaves ignored state and appears as active.See also #4326
The text was updated successfully, but these errors were encountered: