Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve POSIX metadata #24

Merged
merged 8 commits into from
Oct 23, 2024
Merged

Preserve POSIX metadata #24

merged 8 commits into from
Oct 23, 2024

Conversation

alokito
Copy link

@alokito alokito commented Aug 30, 2024

Implements Issue #23

When creating tar archives, copy POSIX metadata from S3 headers into TAR headers.
When extracting tar archives, set S3 headers using POSIX metadata from TAR headers.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@yanko7
Copy link
Contributor

yanko7 commented Aug 30, 2024

Hello @alokito, thank you so much for your contribution! This looks great. Let me test it out and I'll merge it if all goes well.

@alokito
Copy link
Author

alokito commented Aug 30, 2024

Thanks @yanko7 ! Two things

  1. I'm a novice Go programmer, so I'd be grateful for any feedback no matter how nit-picky or small. Please do not hesitate to comment if there are any "code smells" or things that could be more idiomatically Go.
  2. I have tested all three code paths (small files, large files, in memory) but only with a small test case. I am likely to test this against some multi-terabyte archives in the near future; we may want to hold off on merging until it passes the "real world" test.

In light of #2 I'll mark this as "draft" for now.

@alokito alokito marked this pull request as draft August 30, 2024 20:17
mem_concat.go Outdated
@@ -25,7 +26,7 @@ func buildInMemoryConcat(ctx context.Context, client *s3.Client, objectList []*S
}

if estimatedSize < fileSizeMin {
data, err := tarGroup(ctx, client, objectList)
data, err := tarGroup(ctx, client, objectList, opts.PreservePOSIXMetadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be beneficial to pass the whole opts *S3TarS3Options object to tarGroup in case we need to pass more options in the future, that way the function type doesn't keep changing/growing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you wish.

@@ -252,6 +288,18 @@ func concatObjAndHeader(ctx context.Context, svc *s3.Client, objectList []*S3Obj
return results, nil
}

func fetchS3ObjectHead(ctx context.Context, svc *s3.Client, nextObject *S3Obj) *s3.HeadObjectOutput {
Debugf(ctx, "fetching head for %s/%s", *&nextObject.Bucket, *nextObject.Key)
head, err := svc.HeadObject(ctx, &s3.HeadObjectInput{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one thing to consider as well is that every call to HeadObject counts as a GET request. This will essentially double the pricing to build the object. For example if you're trying to create an in-memory tarball with 1,000 Amazon S3 objects, you will need 1,000 GET requests to pull the data, then 1,000 requests to fetch the Head.

Copy link
Author

@alokito alokito Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I have implemented the suggestion for in-memory so we do not call this anymore in that code path, but in other cases where we only call ListObjects I don't think this is avoidable. This is why I decided to add the --preserve-posix-metadata command line argument, since there's no need to do these checks unless you are using FSX or Storage Gateway. Let me know if you have other suggestions.

mem_concat.go Outdated
@@ -240,6 +241,10 @@ func tarGroup(ctx context.Context, client *s3.Client, objectList []*S3Obj) ([]by
AccessTime: *o.LastModified,
Format: tarFormat,
}
if preservePOSIXMetadata {
head := fetchS3ObjectHead(ctx, client, o)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GetObject API call returns a Metadata dictionary as well. It might be worth modifying the downloadS3Data function so it also returns the header data that you need. Have you checked to see if it contains what is needed to presesrve the POSIXMetadata?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I have implemented this.

@alokito alokito marked this pull request as ready for review October 5, 2024 14:37
@alokito
Copy link
Author

alokito commented Oct 5, 2024

@yanko7 I have run several validation tests in different scenarios, and I feel this is ready to be merged now. There were two additional changes since before:

  • Although it is not documented, ctime is also populated by AWS FSX so I added code to preserve that
  • When creating the TAR from parts, s3tar created a folder named output.tar/parts/ as well as a final file named output.tar. This confused FSX since the unix filesystem cannot have both a folder and file with the same name. I addressed this by changing the parts folder to output.tar.parts/, I hope this is acceptable.

@yanko7 yanko7 merged commit 69266e4 into awslabs:main Oct 23, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants