Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted Metadata when Catalog Fails During Commit #2317

Closed
RussellSpitzer opened this issue Mar 10, 2021 · 12 comments
Closed

Corrupted Metadata when Catalog Fails During Commit #2317

RussellSpitzer opened this issue Mar 10, 2021 · 12 comments

Comments

@RussellSpitzer
Copy link
Member

The current logic for doing a commit is at a high level as follows

  1. Write all files including a new-metadata.json for the operation
  2. Acquiring a lock
  3. Swapping the pointer from old-metadata.json to new-metadata.json
  4. Releasing the lock

If we fail during 3 we will always attempt to cleanup the files we created in step 1. This is a problem when step 3 (the swap) has been successful server side but the client has not received an acknowledgment. This leads to a state where

  1. Catalog is pointing to new-metadata.json
  2. Our client is actively removing old-metadata.json and all files which were added for the operation

Future clients which are able to contact the HMS will see the new-metdata.json location, attempt to read it and fail.

What's worse is that if there are multiple clients attempting to work with this table, there is a window of time in which a second client can read new-metadata.json before it is removed and can build a new-new-metadata.json based off of it. The new-new-metadata.json will now have references to files which are in the process of being removed by the first client.

To avoid this we need to handle failures in stage 3 of table commits slightly differently, basically we need to group failures into two categories:

  1. Failures that are reports from the Catalog that it could not perform the operation for some reason
  2. Failures where the client has lost contact for some reason (basically everything else)

Type 1 failures can be cleaned up, we know the commit did not succeed and the files we have currently generated are more or less useless.

Type 2 failures must still be reported as failures, but cannot be retried and cannot be cleaned up. We do not know if our metadata.json is pointing to the new file or not and thus we cannot resolve the situation until communication with the HMS is restored. Since we are usually talking about a client perspective here, I believe the right thing to do is to fail and let the user know they need to manually check and clean up files if necessary.

I haven't checked other catalog implementations to see if they have similar vulnerabilities so we should probably be checking those as well. The HMS code in question is here

CC: @aokolnychyi , @karuppayya , @raptond

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Mar 10, 2021

@RussellSpitzer, will you submit a PR for this? I'll help reviewing.

cc @pvary @rymurr @shardulm94 @rdblue too

@RussellSpitzer
Copy link
Member Author

Yeah I can start digging into this soon, but I only am confident enough to handle the HMS use case here, I would suggest all the other catalog creators do a similar check to see if they have the same issue.

@shardulm94
Copy link
Contributor

shardulm94 commented Mar 11, 2021

Thanks @RussellSpitzer for filing this. We hit this issue at LinkedIn recently with HiveCatalog and were investigating the issue. Your analysis seems to line up with what we too observed. cc: @omalley

@zhangdove
Copy link
Contributor

We are using the Spark computing engine to read MySQL data and overwrite Iceberg (used by HadoopCatalog). When the timeout with MySQL is too long, the data file is cleaned up, while the file new-metadata.json still exists.

I'm not sure if it's the same as this ISSUE, but it looks a bit like @RussellSpitzer 's analysis.

@pvary
Copy link
Contributor

pvary commented Mar 11, 2021

As a Hive user I would hate if I have to handle Type 2 errors. Is this rare enough to leave unhandled?

Just brainstorming:

  • If we want to avoid this situation we can acquire a shared lock for getting the snapshot as well. This would mean that during the swap operation we should wait until the write operation is either finished successfully or failed. Since ideally both swap and getting the snapshot is a very short operation this should not cause too much trouble. OTOH if there are plenty of reads the we can have problem getting the exclusive lock for writes.

@marton-bod
Copy link
Collaborator

When running into a Type 2 error, I think our retry logic would need to be changed. When retrying for a Type 2, we shouldn't cleanup the files prematurely, but instead should first attempt to reconnect to the catalog to double-check if the earlier operation succeeded.

  • If we get an answer, and our snapshot is in the history of the table, then we're essentially done and no need to do anything.
  • If we get an answer, and our snapshot is not in the table history, we proceed with the file cleanup and the retry.
  • If we cannot get ahold of the catalog persistently, then we'll give up the retry operation but still don't do any file cleanup. That ensures that if the operation did succeed, then we're not messing things up. If it was unsuccessful, that could leave some dangling files temporarily but that should be cleaned up by the Cleaner eventually, IIUC.

What do you think?

@rymurr
Copy link
Contributor

rymurr commented Mar 11, 2021

Re Hive: @marton-bod's logic sounds correct to me, including the case of multiple commiters. I guess its a matter of accurately identifying type 2 vs type 1

Re other catalogs. Nessie is susceptible to this issue as well. I think the fix is straightforward - certain exceptions don't delete metadata. Retrying the commit again is safe as it will create an empty commit. Polluting the commit log is annoying but at least the data is safe. Will raise a PR for Nessie soon to fix this. Nessie avoids the branching behaviour described by @RussellSpitzer thanks to it forcing a linear history so its safe with multiple commiters too.

@RussellSpitzer
Copy link
Member Author

@pvary If we do not handle them we basically corrupt the iceberg table when they happen and require manual restoration/

I agree with @marton-bod, we can really do any amount of retries and reconnects but eventually we have to give up and at that point we must not clean up.

@pvary
Copy link
Contributor

pvary commented Mar 11, 2021

I agree with @marton-bod, we can really do any amount of retries and reconnects but eventually we have to give up and at that point we must not clean up.

Is @marton-bod's assumption is correct, that if we give up and the commit was not successful then the dangling files will be cleaned up by the Cleaner eventually?

@RussellSpitzer
Copy link
Member Author

Eventually the end user will need to run "remove orphan files" or something like the case of Type 2 Failure that is actually a failure and not just a network issue.

@pvary
Copy link
Contributor

pvary commented Mar 11, 2021

Eventually the end user will need to run "remove orphan files" or something like the case of Type 2 Failure that is actually a failure and not just a network issue.

Thanks for the info!
So the tools are there, but manual intervention is needed.

The number of these manual interventions could be minimized with @marton-bod's suggestion by checking the table status after a Type 2 failure and running cleanup automatically if the connection is available again and the commit was not successful. This solution looks good to me

@RussellSpitzer
Copy link
Member Author

Posted a WIP ^ for feedback while I work on tests

coolderli pushed a commit to coolderli/iceberg that referenced this issue Apr 26, 2021
stevenzwu pushed a commit to stevenzwu/iceberg that referenced this issue Jul 28, 2021
homar added a commit to homar/trino that referenced this issue May 2, 2022
According to apache/iceberg#2317
it is better not to delete files after commit to metastore failed.
findepi pushed a commit to trinodb/trino that referenced this issue May 6, 2022
According to apache/iceberg#2317
it is better not to delete files after commit to metastore failed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants