Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support upload directory in s3 transfer manager #2743

Merged
merged 10 commits into from
Oct 14, 2021

Conversation

zoewangg
Copy link
Contributor

@zoewangg zoewangg commented Sep 27, 2021

Description

#37
Support upload directory in s3 transfer manager
Sample code

UploadDirectory uploadDirectory =
            tm.uploadDirectory(UploadDirectoryRequest.builder()
                                                     .sourceDirectory(directory)
                                                     .bucket("bucket")
                                                     .overrideConfiguration(o -> o.recursive(false).followSymbolicLinks(true))
                                                     .build());

CompletedUploadDirectory completedUploadDirectory = uploadDirectory.completionFuture().get(5, TimeUnit.SECONDS);

assertThat(completedUploadDirectory.failedUploads()).isEmpty();
assertThat(completedUploadDirectory.successfulObjects()).hasSize(2).containsOnly(completedUpload, completedUpload2);

Users can configure upload settings on the request UploadDirectoryRequest#overrideConfiguration or on the transfer manager itself. Client-level configuration takes precedence over request-level config

        tm = S3TransferManager.builder()
                              .s3ClientConfiguration(b -> b.uploadDirectoryConfiguration(u -> u.followSymbolicLinks(true))
                              .build();

Testing

Added functional tests against different file systems as well as integ tests.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Checklist

  • I have read the CONTRIBUTING document
  • Local run of mvn install succeeds
  • My code follows the code style of this project
  • My change requires a change to the Javadoc documentation
  • I have updated the Javadoc documentation accordingly
  • I have read the README document
  • I have added tests to cover my changes
  • All new and existing tests passed
  • A short description of the change has been added to the CHANGELOG
  • My change is to implement 1.11 parity feature and I have updated LaunchChangelog

License

  • I confirm that this pull request can be released under the Apache 2 license

Comment on lines 157 to 160
PutObjectRequest putObjectRequest = PutObjectRequest.builder()
.bucket(uploadDirectoryRequest.bucket())
.key(key)
.build();
Copy link
Contributor Author

@zoewangg zoewangg Sep 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we are just passing bucket and key and I'm thinking about introducing a transfer extension (similar to the core execution interceptor) for this to allow users to modify other parameters in PutObjectRequest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could do this with an ExecutionInterceptor on the underlying S3 client, once we support configuring the underlying S3 client, right?

I suppose a putObjectRequestSupplier(Supplier<PutObjectRequest>) on the UploadDirectoryRequest would be more friendly, especially if they want request-level parameter configuration. +1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could do this with an ExecutionInterceptor on the underlying S3 client, once we support configuring the underlying S3 client, right?

Yes, that's right.

I suppose a putObjectRequestSupplier(Supplier) on the UploadDirectoryRequest would be more friendly, especially if they want request-level parameter configuration. +1

+1. It would be much easier than what I proposed. I suppose we can use Consumer<PutObjectRequest.Builder> to make it clear that this is a modification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that offering a way to hook into the resulting PutObjectRequest is crucial.

There are a few other use cases I can think of:

  1. I want to be able to supply my own Path-filters. For example, I may want to ensure I exclude sensitive file extensions. E.g., a Predicate<Path>.
  2. I want to be able to customize a key name. For example, I may want to force everything to lowercase. E.g., a configurable Function<Path,String>, or a UnaryOperator<String>.
  3. I want to be able to transform file contents before they are uploaded. For example, I may want to compress everything before uploading.

Not all of these are needed now, but it would be good to make sure we have some type of idea of the path forward here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice summary! I'll create a backlog item for this specific feature because it seems out of scope of this PR, and is additive.

*/
@SdkPublicApi
@SdkPreviewApi
public final class UploadDirectoryConfiguration implements ToCopyableBuilder<UploadDirectoryConfiguration.Builder,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are transfer-manager-level retries handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any transfer-manager-level retries for now. We can definitely add it in the future, something like automatically retrying failed files.

Comment on lines 157 to 160
PutObjectRequest putObjectRequest = PutObjectRequest.builder()
.bucket(uploadDirectoryRequest.bucket())
.key(key)
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could do this with an ExecutionInterceptor on the underlying S3 client, once we support configuring the underlying S3 client, right?

I suppose a putObjectRequestSupplier(Supplier<PutObjectRequest>) on the UploadDirectoryRequest would be more friendly, especially if they want request-level parameter configuration. +1

@aws aws deleted a comment from millems Sep 28, 2021
@@ -42,6 +42,7 @@
private final Double targetThroughputInGbps;
private final Integer maxConcurrency;
private final ClientAsyncConfiguration asyncConfiguration;
private final UploadDirectoryConfiguration uploadDirectoryConfiguration;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since TransferManager is a construct that sits on top of S3Client, I do feel like this configuration belongs better at the TransferManager-level.

Currently we only have UploadDirectoryConfiguration, once we have a similar Download configuration, how would we wish to handle any overlap in both configurations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed. We will allow customers to pass s3 client in the future and will likely make s3ClientConfiguration and s3Client mutually exclusive, which would make configuring uploadDirectoryConfiguration awkward if we keep it inside s3ClientConfiguration.

I'll create a TransferManagerConfiguration and move the options there for now. There are other things we need to consider, e.g., do we want to have separate upload directory and download directory configuration? Separating them is more flexible, but will increase the verbosity and affect the usability.

Do you think this is a blocker to this PR or is it fine to address this when we have the API surface area review(I'll still move the UploadDirectoryConfiguration to TransferManagerConfiguration)?

new TransferConfigurationOption<>("UploadDirectoryFileVisitOption", Boolean.class);


private static final int DEFAULT_UPLOAD_DIRECTORY_MAX_DEPTH = Integer.MAX_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider a more conservative default to protect against cycles? 2 billion is a pretty large default, but I'm not sure how to settle on the right value. Maybe we could try to estimate the maximum possible depth of a typical file system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added a TODO for this. We will revisit all default settings altogether.

Comment on lines 157 to 160
PutObjectRequest putObjectRequest = PutObjectRequest.builder()
.bucket(uploadDirectoryRequest.bucket())
.key(key)
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that offering a way to hook into the resulting PutObjectRequest is crucial.

There are a few other use cases I can think of:

  1. I want to be able to supply my own Path-filters. For example, I may want to ensure I exclude sensitive file extensions. E.g., a Predicate<Path>.
  2. I want to be able to customize a key name. For example, I may want to force everything to lowercase. E.g., a configurable Function<Path,String>, or a UnaryOperator<String>.
  3. I want to be able to transform file contents before they are uploaded. For example, I may want to compress everything before uploading.

Not all of these are needed now, but it would be good to make sure we have some type of idea of the path forward here.

@zoewangg zoewangg force-pushed the zoewang/tmUploadDirectory branch from 3be1e08 to 0228ea5 Compare October 5, 2021 17:02
Comment on lines 84 to 85
CompletableFuture.runAsync(() -> doUploadDirectory(returnFuture, uploadDirectoryRequest),
transferConfiguration.option(TransferConfigurationOption.EXECUTOR));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompletableFuture.runAsync and ignoring the result scares me. If doUploadDirectory fails without completing the returnFuture, we'll never complete the returnFuture.

Should we just do the delegating to the executor ourselves and add the exception handling, or should we rely on the result of CompletableFuture.runAsync instead of managing our own returnFuture?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can add exception handling as well as try-catch logic in doUploadDirectory. I'm not sure how else we can implement cancellation logic w/o managing our own returnFuture though.

*/
@SdkPublicApi
@SdkPreviewApi
public interface CompletedUploadDirectory extends CompletedTransfer {
public final class CompletedUploadDirectory implements CompletedTransfer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this implement CompletedFileTransfer instead?

@@ -20,19 +20,15 @@
import software.amazon.awssdk.annotations.SdkPublicApi;

/**
* A failed single file upload transfer.
* Represents a completed file transfer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we specify if this is for concrete files only, or also directories? Both are represented via File/Path, so it's not immediately obvious to me.

import software.amazon.awssdk.annotations.SdkPublicApi;

/**
* Represents a failed single file transfer in a multi-file transfer operation such as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just have a common interface for any "file based transfer"? That would include:

  1. Requests to upload a single file
  2. Requests to upload a directory of files
  3. The resulting individual single file transfers from \2

They all point to a Path. I'm not sure if we need to distinguish at the interface-level beyond that?

}).collect(Collectors.toList());
entries.forEach(path -> {
CompletableFuture<CompletedUpload> future =
uploadSingleFile(uploadDirectoryRequest, failedUploads, path, phaser);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job extracting helper methods like uploadSingleFile (and in other places). It definitely helps readability.

// Replace "\" (Windows FS) with "/"
return relativePathName.replace('\\', '/');
String separator = fileSystem.getSeparator();
return relativePathName.replace(separator, delimiter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe repeated calls to replace(..) can be very expensive (enough to show up on a profiler in a non-trivial way). Can we save this as a Pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad you brought it up! 😄 🤝 (microbenchmark enthusiasts handshaking). In Java 9+, String.replace actually doesn't use Pattern under the hood anymore, and it is much faster compared with String.replaceAll which uses pattern matching. Per this blog post, in Java 11, String.replace performs much better than String.replaceAll. So I think we should optimize for Java 11. Ideally, we should prefer String.replace(char, char), but this does not work for our case.

I think we can add optimization for the default case where the delimiter is not provided here by using String.replace(char, char).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I was not aware of some of these optimizations. Will help do away with some of the boilerplate needed. But since the SDK depends on Java 8, I think it's reasonable to say we should target performance at the Java 8 level. We can have conditional version checks at runtime if really needed, but we should at least default to supporting Java 8. I also suspect that a large number of users are still running on 8.

Comment on lines 89 to 93
.handle((ignore, t) -> {
// should never execute this
returnFuture.completeExceptionally(t);
return ignore;
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we treat doUploadDirectory(..) as entirely synchronous, we can eliminate its inner try/catch block, and rely on this logic to propagate exceptions.

Similar suggestion for the non-exceptional return case. E.g., if possible, make doUploadDirectory(..) return a value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the try-catch block in doUploadDirectory. Not sure about the benefit of making doUploadDirectory returning a value since we are managing our own returnFuture

// Replace "\" (Windows FS) with "/"
return relativePathName.replace('\\', '/');
String separator = fileSystem.getSeparator();
return relativePathName.replace(separator, delimiter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I was not aware of some of these optimizations. Will help do away with some of the boilerplate needed. But since the SDK depends on Java 8, I think it's reasonable to say we should target performance at the Java 8 level. We can have conditional version checks at runtime if really needed, but we should at least default to supporting Java 8. I also suspect that a large number of users are still running on 8.

@@ -120,6 +120,7 @@ public static Builder builder() {
return DefaultBuilder.class;
}

// TODO: consider consolidating maxDepth and recursive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we agree on this change, I think we should considering doing it now (as part of the initial directory release), to minimize breaking changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that we might change it again before GA. I'd prefer to do it once when we reach to the point where APIs are close be finalized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, but the more I think about it, the more I think we have a pretty strong precedent to say that the boolean isn’t needed. Namely, Files.walk(..) does not accept a boolean either, which we are ultimately delegating to.

https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#walk-java.nio.file.Path-int-java.nio.file.FileVisitOption...-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm still on the fence between this and getting rid of maxDepth because imo most customers probably only care about recursive or not recursive (this is also how AWS CLI s3 cp is designed I think), but I don't have data. It seems the only use-case for maxDepth is to prevent OOM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Maybe we could just use recursive(false) as a convenience alias for maxDepth(1)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be against maxDepth being gone. Feels like it's mostly just for dealing with cycles and symlinks being enabled.

@@ -120,6 +120,7 @@ public static Builder builder() {
return DefaultBuilder.class;
}

// TODO: consider consolidating maxDepth and recursive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be against maxDepth being gone. Feels like it's mostly just for dealing with cycles and symlinks being enabled.

@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 15 Code Smells

77.5% 77.5% Coverage
0.0% 0.0% Duplication

@zoewangg zoewangg merged commit 4a6f5c4 into master Oct 14, 2021
@zoewangg zoewangg deleted the zoewang/tmUploadDirectory branch October 14, 2021 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants