Corrupted Metadata when Catalog Fails During Commit #2317

RussellSpitzer · 2021-03-10T23:03:49Z

The current logic for doing a commit is at a high level as follows

Write all files including a new-metadata.json for the operation
Acquiring a lock
Swapping the pointer from old-metadata.json to new-metadata.json
Releasing the lock

If we fail during 3 we will always attempt to cleanup the files we created in step 1. This is a problem when step 3 (the swap) has been successful server side but the client has not received an acknowledgment. This leads to a state where

Catalog is pointing to new-metadata.json
Our client is actively removing old-metadata.json and all files which were added for the operation

Future clients which are able to contact the HMS will see the new-metdata.json location, attempt to read it and fail.

What's worse is that if there are multiple clients attempting to work with this table, there is a window of time in which a second client can read new-metadata.json before it is removed and can build a new-new-metadata.json based off of it. The new-new-metadata.json will now have references to files which are in the process of being removed by the first client.

To avoid this we need to handle failures in stage 3 of table commits slightly differently, basically we need to group failures into two categories:

Failures that are reports from the Catalog that it could not perform the operation for some reason
Failures where the client has lost contact for some reason (basically everything else)

Type 1 failures can be cleaned up, we know the commit did not succeed and the files we have currently generated are more or less useless.

Type 2 failures must still be reported as failures, but cannot be retried and cannot be cleaned up. We do not know if our metadata.json is pointing to the new file or not and thus we cannot resolve the situation until communication with the HMS is restored. Since we are usually talking about a client perspective here, I believe the right thing to do is to fail and let the user know they need to manually check and clean up files if necessary.

I haven't checked other catalog implementations to see if they have similar vulnerabilities so we should probably be checking those as well. The HMS code in question is here

CC: @aokolnychyi , @karuppayya , @raptond

aokolnychyi · 2021-03-10T23:33:01Z

@RussellSpitzer, will you submit a PR for this? I'll help reviewing.

cc @pvary @rymurr @shardulm94 @rdblue too

RussellSpitzer · 2021-03-11T01:25:58Z

Yeah I can start digging into this soon, but I only am confident enough to handle the HMS use case here, I would suggest all the other catalog creators do a similar check to see if they have the same issue.

shardulm94 · 2021-03-11T01:30:44Z

Thanks @RussellSpitzer for filing this. We hit this issue at LinkedIn recently with HiveCatalog and were investigating the issue. Your analysis seems to line up with what we too observed. cc: @omalley

zhangdove · 2021-03-11T02:51:01Z

We are using the Spark computing engine to read MySQL data and overwrite Iceberg (used by HadoopCatalog). When the timeout with MySQL is too long, the data file is cleaned up, while the file new-metadata.json still exists.

I'm not sure if it's the same as this ISSUE, but it looks a bit like @RussellSpitzer 's analysis.

pvary · 2021-03-11T08:40:20Z

As a Hive user I would hate if I have to handle Type 2 errors. Is this rare enough to leave unhandled?

Just brainstorming:

If we want to avoid this situation we can acquire a shared lock for getting the snapshot as well. This would mean that during the swap operation we should wait until the write operation is either finished successfully or failed. Since ideally both swap and getting the snapshot is a very short operation this should not cause too much trouble. OTOH if there are plenty of reads the we can have problem getting the exclusive lock for writes.

marton-bod · 2021-03-11T10:31:06Z

When running into a Type 2 error, I think our retry logic would need to be changed. When retrying for a Type 2, we shouldn't cleanup the files prematurely, but instead should first attempt to reconnect to the catalog to double-check if the earlier operation succeeded.

If we get an answer, and our snapshot is in the history of the table, then we're essentially done and no need to do anything.
If we get an answer, and our snapshot is not in the table history, we proceed with the file cleanup and the retry.
If we cannot get ahold of the catalog persistently, then we'll give up the retry operation but still don't do any file cleanup. That ensures that if the operation did succeed, then we're not messing things up. If it was unsuccessful, that could leave some dangling files temporarily but that should be cleaned up by the Cleaner eventually, IIUC.

What do you think?

rymurr · 2021-03-11T12:01:56Z

Re Hive: @marton-bod's logic sounds correct to me, including the case of multiple commiters. I guess its a matter of accurately identifying type 2 vs type 1

Re other catalogs. Nessie is susceptible to this issue as well. I think the fix is straightforward - certain exceptions don't delete metadata. Retrying the commit again is safe as it will create an empty commit. Polluting the commit log is annoying but at least the data is safe. Will raise a PR for Nessie soon to fix this. Nessie avoids the branching behaviour described by @RussellSpitzer thanks to it forcing a linear history so its safe with multiple commiters too.

RussellSpitzer · 2021-03-11T13:06:50Z

@pvary If we do not handle them we basically corrupt the iceberg table when they happen and require manual restoration/

I agree with @marton-bod, we can really do any amount of retries and reconnects but eventually we have to give up and at that point we must not clean up.

see apache#2317

pvary · 2021-03-11T13:29:33Z

I agree with @marton-bod, we can really do any amount of retries and reconnects but eventually we have to give up and at that point we must not clean up.

Is @marton-bod's assumption is correct, that if we give up and the commit was not successful then the dangling files will be cleaned up by the Cleaner eventually?

RussellSpitzer · 2021-03-11T13:31:38Z

Eventually the end user will need to run "remove orphan files" or something like the case of Type 2 Failure that is actually a failure and not just a network issue.

pvary · 2021-03-11T13:37:40Z

Eventually the end user will need to run "remove orphan files" or something like the case of Type 2 Failure that is actually a failure and not just a network issue.

Thanks for the info!
So the tools are there, but manual intervention is needed.

The number of these manual interventions could be minimized with @marton-bod's suggestion by checking the table status after a Type 2 failure and running cleanup automatically if the connection is available again and the commit was not successful. This solution looks good to me

RussellSpitzer · 2021-03-11T18:32:15Z

Posted a WIP ^ for feedback while I work on tests

Fixes apache#2317

According to apache/iceberg#2317 it is better not to delete files after commit to metastore failed.

rymurr pushed a commit to rymurr/iceberg that referenced this issue Mar 11, 2021

Ensure the metadata isn't cleaned up in the case of unknown error

101308f

see apache#2317

rymurr mentioned this issue Mar 11, 2021

Ensure the metadata isn't cleaned up in the case of unknown error #2324

Closed

RussellSpitzer mentioned this issue Mar 11, 2021

(#2317) Stop removal of files when catalog state is uncertain - HiveCatalog #2328

Merged

aokolnychyi closed this as completed in 62ed3c2 Mar 23, 2021

coolderli pushed a commit to coolderli/iceberg that referenced this issue Apr 26, 2021

Hive: Don't delete files when commit state is unknown (apache#2328)

97fb41b

Fixes apache#2317

RussellSpitzer mentioned this issue May 18, 2021

metadata.json not found #2602

Closed

coolderli mentioned this issue May 21, 2021

Core: Skip delete data files when commit-state is unknown. #2621

Closed

stevenzwu pushed a commit to stevenzwu/iceberg that referenced this issue Jul 28, 2021

Hive: Don't delete files when commit state is unknown (apache#2328)

08d4945

Fixes apache#2317

RussellSpitzer mentioned this issue Jan 4, 2022

Iceberg table may be corrupted when HMS/catalog commit fails due to network reasons trinodb/trino#10462

Closed

homar added a commit to homar/trino that referenced this issue May 2, 2022

Don't delete files when commit to hive metastore fails

755c7a1

According to apache/iceberg#2317 it is better not to delete files after commit to metastore failed.

homar mentioned this issue May 2, 2022

Don't delete files when commit to hive metastore fails trinodb/trino#12209

Merged

findepi pushed a commit to trinodb/trino that referenced this issue May 6, 2022

Don't delete files when commit to hive metastore fails

068787a

According to apache/iceberg#2317 it is better not to delete files after commit to metastore failed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted Metadata when Catalog Fails During Commit #2317

Corrupted Metadata when Catalog Fails During Commit #2317

RussellSpitzer commented Mar 10, 2021

aokolnychyi commented Mar 10, 2021 •

edited

Loading

RussellSpitzer commented Mar 11, 2021

shardulm94 commented Mar 11, 2021 •

edited

Loading

zhangdove commented Mar 11, 2021

pvary commented Mar 11, 2021 •

edited

Loading

marton-bod commented Mar 11, 2021

rymurr commented Mar 11, 2021

RussellSpitzer commented Mar 11, 2021

pvary commented Mar 11, 2021

RussellSpitzer commented Mar 11, 2021

pvary commented Mar 11, 2021 •

edited

Loading

RussellSpitzer commented Mar 11, 2021

Corrupted Metadata when Catalog Fails During Commit #2317

Corrupted Metadata when Catalog Fails During Commit #2317

Comments

RussellSpitzer commented Mar 10, 2021

aokolnychyi commented Mar 10, 2021 • edited Loading

RussellSpitzer commented Mar 11, 2021

shardulm94 commented Mar 11, 2021 • edited Loading

zhangdove commented Mar 11, 2021

pvary commented Mar 11, 2021 • edited Loading

marton-bod commented Mar 11, 2021

rymurr commented Mar 11, 2021

RussellSpitzer commented Mar 11, 2021

pvary commented Mar 11, 2021

RussellSpitzer commented Mar 11, 2021

pvary commented Mar 11, 2021 • edited Loading

RussellSpitzer commented Mar 11, 2021

aokolnychyi commented Mar 10, 2021 •

edited

Loading

shardulm94 commented Mar 11, 2021 •

edited

Loading

pvary commented Mar 11, 2021 •

edited

Loading

pvary commented Mar 11, 2021 •

edited

Loading