Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multipart uploads can lead to loss of metadata #1145

Open
pengisgood opened this issue Feb 12, 2015 · 10 comments
Open

Multipart uploads can lead to loss of metadata #1145

pengisgood opened this issue Feb 12, 2015 · 10 comments
Labels
feature-request A feature should be added or improved. p2 This is a standard priority issue s3copy-extra-data s3sync s3

Comments

@pengisgood
Copy link

Scenario 1:

`aws s3 sync s3://files/ s3://files-backup --recursive --profile <user_name>`
  • sync files between two buckets under the same account and the same region, some files' metadata lost with key like x-amz-meta-json

Scenario 2:

`aws s3 sync s3://files-dev/ s3://files-prod --recursive --profile <user_name>`
  • sync files between two buckets which is belong to two account and the same region, for example, one account is for development and the other one is for production, some files' metadata also lost with key like x-amz-meta-json.

    Note: the batch of files which lost metadata are the same in the two scenarios above.

    Does anyone has the same issue?

@kyleknap
Copy link
Contributor

This behavior is known. If I were to guess, the files that the metadata are being lost are the ones that are being multipart copied. If a file is multipart copied, metadata is not automatically copied over like it is for a non-multipart copy. The problem is to get all of the information exactly transferred over for a multipart copy, it would require roughly 4 to 5 more calls (such as HeadObject, GetObjectACL calls) on the object to get all of the information required to do an exact copy. Do you happen to know the size of these objects that do not have the original metadata?

If that is the case, there is currently one work around which is to avoid using multipart copies. Take a look at this pull request: #1122. Using the config file you can set the threshold upon which you start doing multipart thresholds. So if you set the threshold higher than the maximum size of your file, but less than 5GB, you should be better off.

@makmanalp
Copy link

Not to be sour or ungrateful but I just sank a few hours into figuring this out, combined with #319.

I had issues with the file metadata randomly appearing and not appearing in s3 sync in no discernible manner (initially I thought that s3 had some sort of unintentional "memory" of the metadata of the files being deleted and recreated) and then I finally found #1145. Workaround of aws configure set s3.multipart_threshold $MAX_SIZE --profile $PROFILENAME worked.

This is the kind of known issue that should be in screaming red letters all over the s3 sync documentation. I just blew a few hours on it, which in the grand scheme of things is not that bad, because it could have been that I synced thousands of production data files over thinking it worked fine, and then deleted the originals. Silent data corruption errors are not cool, and I realize fixing something might not be simple on your end, but then in the interim please let your users know.

@kyleknap - you were right, it was the multipart thing.

@recastrodiaz
Copy link

Another side effect of aws s3 sync using multipart uploads is that ObjectCreatedByPut events are no longer sent to AWS Lambda, thus Lambda functions relying on this trigger won't work for files bigger than 8MB.

@makmanalp's work around seems to get around this issue too:

aws configure set s3.multipart_threshold 128MB

@alecbz
Copy link

alecbz commented Dec 18, 2017

@makmanalp +1000, this is a fucking insane, massive bug. It'd be one thing if sync just dropped the content-encoding altogether, but the way things are currently, one might think "huh, I should double-check that sync actually preserves content encoding" *try it on a few files* "okay, looks good, let's do it on everything".

@pauldraper
Copy link

pauldraper commented Jan 31, 2018

Not to minimize the insane, massive bug, but this originates as a limitation of S3. aws/aws-sdk-java#367

Unfortunately S3 does not support the x-amz-metadata-directive header on InitiateMultipartTransfer or CopyPart requests. I've raised this to the service team and will come back on this issue when I hear back from them.

@gribbet I contacted the S3 service team and they are aware of the inconsistency - it's possible that they'll fix it in a future version of the service. However given there is a workaround there are higher priority issues to resolve.

"Given there is a workaround" is perhaps generous in the case of aws s3 sync/cp, and it could be argued that given S3's current inabilities, the CLI should choose a different default. Or it could implement the partial workaround itself.

This limitation appears on the CLI documentation, though it is somewhat buried considering the severity of the issue.

@alecbz
Copy link

alecbz commented Jan 31, 2018

it could be argued that given S3's current inabilities, the CLI should choose a different default.

Yeah, definitely, it seems much saner to default to never using multi-part copies for objects with a content-encoding or other metadata that will be dropped by it. (Or even to always drop the metadata, maybe with a flag to preserve it).

@ASayre
Copy link
Contributor

ASayre commented Feb 6, 2018

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

This entry can specifically be found on UserVoice at:https://aws.uservoice.com/forums/598381-aws-command-line-interface/suggestions/33168427-does-aws-s3-sync-will-loss-metadata-sometimes

@ASayre ASayre closed this as completed Feb 6, 2018
@jamesls
Copy link
Member

jamesls commented Apr 6, 2018

Based on community feedback, we have decided to return feature requests to GitHub issues.

@zachigene
Copy link

Hi,
We are uploading very large files (>20GB) so we must use Multipart upload. Is there a plan when this inconsistency will be fixed? It is an old issue but from some reason being ignored.

@petrgrishin
Copy link

😱

@tim-finnigan tim-finnigan changed the title Does aws s3 sync will loss metadata sometimes? Multipart uploads can lead to loss of metadata Nov 2, 2022
@tim-finnigan tim-finnigan added the p2 This is a standard priority issue label Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. p2 This is a standard priority issue s3copy-extra-data s3sync s3
Projects
None yet
Development

No branches or pull requests