-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted Metadata when Catalog Fails During Commit #2317
Comments
@RussellSpitzer, will you submit a PR for this? I'll help reviewing. cc @pvary @rymurr @shardulm94 @rdblue too |
Yeah I can start digging into this soon, but I only am confident enough to handle the HMS use case here, I would suggest all the other catalog creators do a similar check to see if they have the same issue. |
Thanks @RussellSpitzer for filing this. We hit this issue at LinkedIn recently with HiveCatalog and were investigating the issue. Your analysis seems to line up with what we too observed. cc: @omalley |
We are using the Spark computing engine to read MySQL data and overwrite Iceberg (used by HadoopCatalog). When the timeout with MySQL is too long, the data file is cleaned up, while the file new-metadata.json still exists. I'm not sure if it's the same as this ISSUE, but it looks a bit like @RussellSpitzer 's analysis. |
As a Hive user I would hate if I have to handle Type 2 errors. Is this rare enough to leave unhandled? Just brainstorming:
|
When running into a Type 2 error, I think our retry logic would need to be changed. When retrying for a Type 2, we shouldn't cleanup the files prematurely, but instead should first attempt to reconnect to the catalog to double-check if the earlier operation succeeded.
What do you think? |
Re Hive: @marton-bod's logic sounds correct to me, including the case of multiple commiters. I guess its a matter of accurately identifying type 2 vs type 1 Re other catalogs. Nessie is susceptible to this issue as well. I think the fix is straightforward - certain exceptions don't delete metadata. Retrying the commit again is safe as it will create an empty commit. Polluting the commit log is annoying but at least the data is safe. Will raise a PR for Nessie soon to fix this. Nessie avoids the branching behaviour described by @RussellSpitzer thanks to it forcing a linear history so its safe with multiple commiters too. |
@pvary If we do not handle them we basically corrupt the iceberg table when they happen and require manual restoration/ I agree with @marton-bod, we can really do any amount of retries and reconnects but eventually we have to give up and at that point we must not clean up. |
Is @marton-bod's assumption is correct, that if we give up and the commit was not successful then the dangling files will be cleaned up by the Cleaner eventually? |
Eventually the end user will need to run "remove orphan files" or something like the case of Type 2 Failure that is actually a failure and not just a network issue. |
Thanks for the info! The number of these manual interventions could be minimized with @marton-bod's suggestion by checking the table status after a Type 2 failure and running cleanup automatically if the connection is available again and the commit was not successful. This solution looks good to me |
Posted a WIP ^ for feedback while I work on tests |
According to apache/iceberg#2317 it is better not to delete files after commit to metastore failed.
According to apache/iceberg#2317 it is better not to delete files after commit to metastore failed.
The current logic for doing a commit is at a high level as follows
If we fail during 3 we will always attempt to cleanup the files we created in step 1. This is a problem when step 3 (the swap) has been successful server side but the client has not received an acknowledgment. This leads to a state where
Future clients which are able to contact the HMS will see the new-metdata.json location, attempt to read it and fail.
What's worse is that if there are multiple clients attempting to work with this table, there is a window of time in which a second client can read new-metadata.json before it is removed and can build a new-new-metadata.json based off of it. The new-new-metadata.json will now have references to files which are in the process of being removed by the first client.
To avoid this we need to handle failures in stage 3 of table commits slightly differently, basically we need to group failures into two categories:
Type 1 failures can be cleaned up, we know the commit did not succeed and the files we have currently generated are more or less useless.
Type 2 failures must still be reported as failures, but cannot be retried and cannot be cleaned up. We do not know if our metadata.json is pointing to the new file or not and thus we cannot resolve the situation until communication with the HMS is restored. Since we are usually talking about a client perspective here, I believe the right thing to do is to fail and let the user know they need to manually check and clean up files if necessary.
I haven't checked other catalog implementations to see if they have similar vulnerabilities so we should probably be checking those as well. The HMS code in question is here
CC: @aokolnychyi , @karuppayya , @raptond
The text was updated successfully, but these errors were encountered: