S3 streaming with unknown content size #139

shorea · 2017-09-05T16:12:44Z

Unclear what exactly this would mean, some investigation is required. But for uploads I imagine we'd do some buffering of the content, if it's greater than 5MB we would do a multipart upload with some retries of parts. For downloads this might be a special input stream returned to the caller that can reconnect on failure by doing a range GET to retrieve the rest of the bytes.

rdifalco · 2017-09-06T17:50:32Z

I'd like to vote for automatic reconnect on download. So you'd have something that looks like a BufferedInputStream. In one mode each BufferSize chunk would be a GetRange and it goes right into the byte array. That way if the client is slow to process the stream they will not get any connection reset errors. With a good buffer size (maybe even a buffer array to load some buffers in parallel) it should be super fast while being very stable. If there is a Connection read error in the short time it takes to copy the range into the byte buffer then I would want it to use the client configured Retry configuration to retry.

MikeFHay · 2017-11-10T11:37:21Z

Support for retrying with non-resettable InputStreams would be nice. See aws/aws-sdk-java#427

MikeFHay · 2017-12-12T12:31:00Z

Additionally, it looks to me like TransferManager doesn't support doing multi-part uploads of data with an unknown size. This makes it pretty useless for big data.

laymain · 2018-09-28T12:56:38Z

Any updates on that issue?

varunnvs92 · 2018-09-28T15:38:36Z

@laymain
We are focusing on getting the SDK 2.0 ready with low-level clients. Most of the high level libraries (TransferManager, DynamoDB mapper) won't be available until early 2019.

laymain · 2018-09-28T16:12:24Z

I understand, thank you for the answer.

millems · 2019-07-08T23:03:20Z

Tracking in #37

djchapm · 2022-06-21T20:00:01Z

Adding a use case - code below is based on code currently available in TransferManager 2.17.210-PREVIEW... there is a 'fromPublisher' method that accepts Publisher <ByteBuffer> - so below is simple demo code and my env details.

Another use case, and more of a priority - uploading compressed content (or anything else) via java.nio.OutputStream. i.e. we will create a java GZipOutputStream - write to it using ByteBuffer writes - when we close the output stream, expect TransferManager to wrap things up regarding parts/content-length etc.

Hoping for code like:
... .requestBody(AsyncRequestBody.fromOutputStream(myOutputStream))

Here is demo of native error we get (shown in the commented code) - it's based on no content-length in the put object request.

This works when using AsyncRequestBody.fromXXX where it is something other than a Publisher (File or String). So goal would be to update the content length onClose(), and upload parts based on the S3ClientConfiguration settings.

/**
 * Java 17
 * s3-transfer-manager-2.17.210-PREVIEW.jar
 * sdk-core-2.17.210.jar
 * reactor-core-3.4.13.jar
 * Native CLIv2 on MacOS monterey: 
 *    aws-cli/2.7.7 Python/3.9.13 Darwin/21.5.0 source/x86_64
 */
class AWSTransferManagerTest {

    public static void main (String[] args) {
        S3ClientConfiguration s3TransferConfig = S3ClientConfiguration.builder()
                .targetThroughputInGbps(20.0)
                .minimumPartSizeInBytes(1024L)
                .build();
        S3TransferManager transferManager = S3TransferManager.builder()
                .s3ClientConfiguration(s3TransferConfig)
                .build();
        Flux<ByteBuffer> flux = Flux.just("one", "two", "three")
                .map(val -> ByteBuffer.wrap(val.getBytes()));
        //verify flux:
        //flux.subscribe(System.out::println);

        Log.initLoggingToFile(Log.LogLevel.Trace, "s3.logs");
        //output in s3.logs:
        // [INFO] [2022-06-16T15:19:49Z] [00007000056fd000] [S3Client] - id=0x7fab24928d20 Initiating making of meta request
        // [ERROR] [2022-06-16T15:19:49Z] [00007000056fd000] [S3MetaRequest] - Could not create auto-ranged-put meta request; there is no Content-Length header present.
        // [ERROR] [2022-06-16T15:19:49Z] [00007000056fd000] [S3Client] - id=0x7fab24928d20: Could not create new meta request.
        // [WARN] [2022-06-16T15:19:50Z] [0000700005d0f000] [JavaCrtGeneral] - Not all native threads were successfully joined during gentle shutdown.  Memory may be leaked.

        Upload upload =
                transferManager.upload(UploadRequest.builder()
                        .putObjectRequest(b -> b.bucket("bucket").key("tmp/flux.bin"))
                        .requestBody(AsyncRequestBody.fromPublisher(flux))
                        .overrideConfiguration(b -> b.addListener(LoggingTransferListener.create()))
                        .build());
        CompletedUpload completedUpload = upload.completionFuture().join();
    }
}

djchapm · 2023-05-12T21:09:08Z

Hey check this out - awslabs/aws-c-s3#285
So using AWS S3 V2 API: S3AsyncClient.crtBuilder() and S3TransferManager, we may soon be able to achieve this 8 year old ticket! (8 years due to ticket chaining - all asking for a way to stream data to S3)

@yasminetalby / @zoewangg Can we maybe move up the priority on this considering the action on the CRT Native Client?

Thanks - also I have some updated code, the stuff above was created with s3 v2 beta code, locally I have updated for v2.20.51.

Regards!

zoewangg · 2023-05-12T21:12:39Z

Hey @djchapm, yes! We are aware of this change and will bump CRT Java binding version once it's available. There should be no additional change needed on the Java SDK side.

djchapm · 2023-05-23T14:41:03Z

Looks like awslabs/aws-c-s3#285 is merged - @zoewangg can you confirm which version of the sdk will contain this change?

zoewangg · 2023-05-25T20:53:01Z

Hi @djchapm, the change is not available in the Java binding yet, we'll let you know once it's ready.

TingDaoK · 2023-06-05T18:57:20Z

@zoewangg The latest release of aws-crt-java@0.22.1 supports the unknown content-length now.

zoewangg · 2023-06-05T21:42:34Z

Submitted PR to bump CRT version #4065

djchapm · 2023-06-08T16:10:08Z

Hey @zoewangg - it's working great! I haven't done any performance comparisons but will report. Definitely reduces a ton of code though. So far I've only tried with a few MBs for relatively short durations. I'll want to try and keep things open for a few Gigs and hours before we conclude anything.

What do you think about adding the ability to update the metadata just before we trigger the completion of the output stream? I've tried digging into this and obtaining a handle to the metadata map but it is converted to an unmodifiable map in the UploadRequest, and I'm not sure where in the process the meta data is actually uploaded. If it is at the end though - do you think this would be a possibility?

Our only alternative right now is to do a copy to update metadata which in our experience can take from 1 to 20 minutes.

djchapm · 2023-06-15T19:53:52Z

I have an implementation that works, but I can't figure out what is taking so long once I mark the stream complete. I have a TransferListener hooked up and there is maybe only one more call to listener.bytesTransferred after I have closed. But then it seems to go into a black hole for about a minute before calling the transferComplete method.

I don't see any way to forcefully tell the Upload/s3Client/TransferManager that we're done and to get it over with. I can't really debug because it's all happening async in native-land.

Any thoughts? I'm only dealing with a 140MB upload. For our solution/metrics this amounts to about 7000 "messages" per second whereas the old solution in aws sdk 1 where we manage the multiparts and keep track of content size and then update the metadata after the fact using a copy - that whole process achieves 224000 messages per second in the same exact test. So this bottleneck at the end seems like a pretty big issue. :(

djchapm · 2023-06-16T22:02:57Z

Having some random failures etc... and some slowness in completing uploads. I uploaded a java test project and created issue: awslabs/aws-c-s3#317
@zoewangg Would be cool if you could check the zip I uploaded - Not sure if the issue is in TransferManager implementation or in aws-c-s3.

zoewangg · 2023-06-16T22:35:51Z

Hi @djchapm yeah, we will work with CRT team to investigate this issue.

To answer your previous question (sorry for the delay)

What do you think about adding the ability to update the metadata just before we trigger the completion of the output stream?

Just for my understanding, is there any reason you can't set the metadata in the original request? I think metadata is being sent in CreateMultipartUploadRequest, so it happens before streaming starts.

djchapm · 2023-06-20T14:14:43Z

Yea - we're uploading metadata/indicators to our files based on what's been processed in the file for disaster recovery scenarios in streaming. Those indicators are typically cumulative counts or 'last entry' markers in the object which can't be calculated ahead of time.

djchapm · 2023-06-20T17:52:45Z

~~Opened Feature Request to support using this feature via an java.io.OutputStream implementation by opening access to BlockingOutputStreamAsyncRequestBody: #4119~~

My bad... it is open.

djchapm · 2023-07-06T19:26:10Z

Hey @zoewangg - I can confirm that this is working. I've done several rounds of testing with streaming to S3OutputStream of up to 17Gigs in size. Potential drawback for some people is that the meta data for content size is not attached to the object in S3 - might be good thing to add as feature request.

For anyone looking for example code you can see a temp solution I posted on related ticket awslabs/aws-c-s3#317

Thanks!

debora-ito · 2023-08-28T19:21:03Z

@djchapm glad that this is working for you.

Marking this to auto-close soon. Please let us know if there's any further questions.

guss77 · 2023-09-02T22:26:41Z

Is this issue supposed to be already fixed in the current AWS Java SDK (2.20.140), because in my tests today I still get an error S3Exception: You must provide the Content-Length HTTP header. (Service: S3, Status Code: 411... when not providing the content length in the PutObjectRequest.

AndreasChristianson · 2024-02-23T16:03:33Z

I'm still seeing this error on software.amazon.awssdk:s3:2.20.63

shorea added 2.0 New TransferManager labels Sep 5, 2017

shorea mentioned this issue Sep 5, 2017

[S3 upload] API to upload outputstream aws/aws-sdk-java#1268

Closed

dagnir mentioned this issue Sep 5, 2017

Transfer Manager Issues aws/aws-sdk-java#1288

Closed

shorea mentioned this issue Sep 13, 2017

Allow Support for Uploading Byte Arrays and Strings in S3 TransferManager aws/aws-sdk-java#964

Closed

millems added the Feature Request label Nov 21, 2018

justnance added feature-request A feature should be added or improved. and removed Feature Request labels Apr 19, 2019

millems closed this as completed Jul 8, 2019

djchapm mentioned this issue Jun 21, 2022

Support S3TransferManager upload using Flux<ByteBuffer> #2714

Closed

zoewangg reopened this Jun 21, 2022

zoewangg changed the title ~~Better support for InputStream in TransferManager~~ S3 streaming with unknown content size Jun 21, 2022

zoewangg mentioned this issue Jun 21, 2022

Transfer Manager #37

Closed

djchapm mentioned this issue Jun 21, 2022

Mutipart upload from InputStream (without supplying content-length) uses putObject(). aws/aws-sdk-java#474

Closed

yasminetalby added the p3 This is a minor priority issue label Nov 28, 2022

zoewangg mentioned this issue Jun 5, 2023

Bump aws-crt to 0.22.1 #4065

Merged

debora-ito added the closing-soon This issue will close in 4 days unless further comments are made. label Aug 28, 2023

github-actions bot added closed-for-staleness and removed closing-soon This issue will close in 4 days unless further comments are made. labels Aug 30, 2023

github-actions bot closed this as completed Aug 30, 2023

alexmojaki mentioned this issue Jul 8, 2024

Support AWS SDK version 2 alexmojaki/s3-stream-upload#23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 streaming with unknown content size #139

S3 streaming with unknown content size #139

shorea commented Sep 5, 2017

rdifalco commented Sep 6, 2017

MikeFHay commented Nov 10, 2017

MikeFHay commented Dec 12, 2017

laymain commented Sep 28, 2018

varunnvs92 commented Sep 28, 2018

laymain commented Sep 28, 2018

millems commented Jul 8, 2019

djchapm commented Jun 21, 2022

djchapm commented May 12, 2023

zoewangg commented May 12, 2023

djchapm commented May 23, 2023 •

edited

Loading

zoewangg commented May 25, 2023

TingDaoK commented Jun 5, 2023

zoewangg commented Jun 5, 2023

djchapm commented Jun 8, 2023

djchapm commented Jun 15, 2023

djchapm commented Jun 16, 2023

zoewangg commented Jun 16, 2023

djchapm commented Jun 20, 2023

djchapm commented Jun 20, 2023 •

edited

Loading

djchapm commented Jul 6, 2023

debora-ito commented Aug 28, 2023

guss77 commented Sep 2, 2023

AndreasChristianson commented Feb 23, 2024

S3 streaming with unknown content size #139

S3 streaming with unknown content size #139

Comments

shorea commented Sep 5, 2017

rdifalco commented Sep 6, 2017

MikeFHay commented Nov 10, 2017

MikeFHay commented Dec 12, 2017

laymain commented Sep 28, 2018

varunnvs92 commented Sep 28, 2018

laymain commented Sep 28, 2018

millems commented Jul 8, 2019

djchapm commented Jun 21, 2022

djchapm commented May 12, 2023

zoewangg commented May 12, 2023

djchapm commented May 23, 2023 • edited Loading

zoewangg commented May 25, 2023

TingDaoK commented Jun 5, 2023

zoewangg commented Jun 5, 2023

djchapm commented Jun 8, 2023

djchapm commented Jun 15, 2023

djchapm commented Jun 16, 2023

zoewangg commented Jun 16, 2023

djchapm commented Jun 20, 2023

djchapm commented Jun 20, 2023 • edited Loading

djchapm commented Jul 6, 2023

debora-ito commented Aug 28, 2023

guss77 commented Sep 2, 2023

AndreasChristianson commented Feb 23, 2024

djchapm commented May 23, 2023 •

edited

Loading

djchapm commented Jun 20, 2023 •

edited

Loading