Mutipart upload from InputStream (without supplying content-length) uses putObject(). #474

selaamobee · 2015-07-30T16:28:41Z

I am calling com.amazonaws.services.s3.transfer.TransferManager.upload(String, String, InputStream, ObjectMetadata) and expect it to upload the stream in chunks but the API tried to upload it in one chunk.

After digging in the code a bit what I discovered this is the problematic call order:
com.amazonaws.services.s3.transfer.internal.UploadCallable.call()
com.amazonaws.services.s3.transfer.internal.UploadCallable.isMultipartUpload()
com.amazonaws.services.s3.transfer.internal.TransferManagerUtils.shouldUseMultipartUpload(PutObjectRequest, TransferManagerConfiguration)
com.amazonaws.services.s3.transfer.internal.TransferManagerUtils.getContentLength(PutObjectRequest)

The getContentLength returns -1 if the input is a stream (and content-length wasn't supplied).
The shouldUseMultipartUpload returns true if contentLength > configuration.getMultipartUploadThreshold() and since -1 is not larger than that it doesn't use multi-part upload (and for me later fails because my stream is too big to buffer by the API).

The text was updated successfully, but these errors were encountered:

david-at-aws · 2015-07-30T17:39:15Z

This is the expected behavior if you don't provide a content-length, at least for the moment. If you do know the content length, providing it up front will allow the TransferManager to be significantly more efficient.

S3 requires a content-length header to be specified on each putObject/uploadPart request - if the length isn't known up front (and the stream doesn't support mark/reset), the SDK attempts to buffer the stream into memory to find out how large it is. This obviously only works for relatively small streams - for anything large enough that it can't be buffered in memory, you need to provide a content-length up front. The TransferManager assumes that if you haven't provided a content-length, the stream is small enough that doing a single-part upload will be more efficient than a multi-part upload.

Theoretically we could add a configuration option to force a multi-part upload even when the content-length is unknown, using the configured minimum part size and hoping for the best? You'd still have to buffer 5MB at a time to calculate the content-length for each individual part (more if you want to go over ~50GB total, since S3 allows a maximum of 10K parts per MPU), but at least you wouldn't have to buffer the entire thing if you legitimately don't know how large it's going to be.

I'll add this to our backlog (please chime in here if this would be useful to you so we can prioritize appropriately), or I'd be happy to take a look at a pull request.

selaamobee · 2015-07-31T17:20:02Z

Hello David and thank you for the quick and detailed response.

I am trying to write to S3 from a ZipInputStream and the ZipEntry.getSize() may return -1.

Buffering the file in memory is not an option since it is quite large and I get an OutOfMemoryException.

Writing it to disk is probably what I do but I would have preferred not to do that.

Seems to me like a valid and common use case.

david-at-aws · 2015-07-31T17:41:53Z

Yeah, buffering to disk is the best option in the short term. On the plus side, the TransferManager can upload multiple parts in parallel when you give it a file, as opposed to having to read/upload from the ZipInputStream serially - may even end up being faster.

alexmojaki · 2015-10-22T14:19:03Z

I've written a library that automatically chunks data from a stream into parts so that you can avoid both running out of memory and writing to disk: https://github.com/alexmojaki/s3-stream-upload

selaamobee · 2015-10-22T14:59:16Z

Thank you Alex, looks nice. My company probably won't give me time to return to this feature now and fix it, but if it will happen I'll try and use your library.

mikaelnousiainen · 2016-09-14T11:16:05Z

TransferManager definitely needs this feature. The size of the stream to be uploaded is often unknown, for example when the data is being generated by or queried from an external service/database.

shokeenAnil · 2017-02-16T10:04:54Z

@david-at-aws David, is there any ETA for this change?

jhaber · 2017-03-28T23:32:12Z

For anyone running into this issue while using AmazonS3#putObject(PutObjectRequest), we ended up overriding this method so that if it encounters an InputStream with unknown length, it will read a fixed amount of data into a buffer (1MB, for example). If it reaches the end of the stream before filling the buffer, then we can just upload the buffer. If we fill the buffer before exhausting the stream, then we write to a temporary file and upload that instead. The code is here if in case anyone wants to do something similar. (and once TransferManager uses NIO we'll probably switch to that)

ronreiter · 2018-03-30T06:27:19Z

@david-at-aws perhaps you could reconsider implementing support for unknown large file sizes using the transfermanager? it's an important feature.

eocron · 2018-10-09T07:20:12Z

Is there any progress in this direction? There is a lot of streams in a world which performs operations not knowing final length: some zip imlementations, serializations, encryptions, transfers, etc. This particular requirement not allow to make simple transition of stream between endpoints without buffering it somewhere.

This approach acceptable only for relatively small files, which can be sent like byte array to lower RPC even without using Streams. For streams this is unaccepatable.

jcogilvie · 2019-03-06T20:07:54Z

+1 for buffering to a temp file (or a byte array buffer) up to the multipart threshold. Not all streams have a length known up front; in fact, this is part of the reason to use a stream and not just a file, so it's important to account for this.

svobol13 · 2019-06-11T19:14:58Z

For our use case this huge dealbreaker. Within single execution of our task we generate couple of files each of size couple of hundreds MB. Since Lambda would be best fit for that thanks to this missing feature it is not the case. We cannot buffer (size wise) neither to disk nor to memory. We can either switch language (node supports streaming to S3 but the task runs much slower) or switch computation service (EC2 supports larger disk for buffering or huge memory but is more expensive and it is more work to operate it) or switch cloud provider (the other one supports atleast getting OutputStream in Java). I am deeply disappointed. Didn't expect such important feature will be missing.

alexmojaki · 2019-06-11T19:25:43Z

I hope this isn't considered spammy, but I'm gonna mention my library again (https://github.com/alexmojaki/s3-stream-upload) because (1) I think @svobol13 would find it very useful and (2) based on his comment, I think he and possibly others are not seeing my first comment lost in the middle of this thread.

BramDeCneudt · 2019-08-07T07:53:41Z

Is there any progress in this? I'll be using the third-party library that @alexmojaki suggested but this seems like an important feature missing in the official AWS library.

prameshbhattarai · 2020-09-29T10:13:17Z

I was using AWS MutlipartUpload (low-level API) for uploading InputStream without Content-Length, here is a link to my code snippet -> https://gist.github.com/prameshbhattarai/dbb19eb2518ab0554e6aeb36b92ee84a#file-s3multipartupload-java
https://medium.com/@pra4mesh/uploading-inputstream-to-aws-s3-using-multipart-upload-java-add81b57964e

fkoner · 2020-11-11T01:08:44Z

@alexmojaki

¿How could I use your library?
I have a ByteArrayInputStream (or something similar).

ObjectMapper objectMapper = new ObjectMapper();

Map<String, Object> obj = objectMapper.readValue(is, HashMap.class);

// Execute some transformation tasks

ByteArrayOutputStream baos = new ByteArrayOutputStream();
objectMapper.writeValue(baos, obj);
ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());

// How to invoke StreamTransferManager ???

fkoner · 2020-11-11T02:16:58Z

@alexmojaki

Finally I got It working

      int numStreams = 4;
      final StreamTransferManager manager = new StreamTransferManager(bucketName, keyNameJSON, s3Client)
        .numStreams(numStreams)
        .numUploadThreads(numStreams)
        .queueCapacity(numStreams)
        .partSize(5);

      manager.customiseUploadPartRequest(new UploadPartRequest().withObjectMetadata(metadata));

      final List<MultiPartOutputStream> streams = manager.getMultiPartOutputStreams();

      final int UPLOAD_PART_SIZE = 1 * Constants.MB;

      ExecutorService pool = Executors.newFixedThreadPool(numStreams);
      for (int i = 0; i < numStreams; i++) {
        final int streamIndex = i;
        pool.submit(new Runnable() {
          public void run() {
            try {
              MultiPartOutputStream outputStream = streams.get(streamIndex);
              int nRead = 0;
              byte[] data = new byte[UPLOAD_PART_SIZE];
              while ((nRead = zipBais.read(data, 0, data.length)) != -1) {
                outputStream.write(data, 0, nRead);
              }

              // The stream must be closed once all the data has been written
              outputStream.close();
            } catch (Exception e) {
              e.printStackTrace();
              // Aborts all uploads
              manager.abort(e);
            }
          }
        });
      }

      pool.shutdown();

      boolean wait = true;

      while (wait) {
        try {
          wait = !pool.awaitTermination(2, TimeUnit.SECONDS);
          log.info("wait, {}", wait);
          if (wait) {
            log.info("Awaiting completion of bulk callback threads.");
          }
        } catch (InterruptedException e) {
          log.info("Interruped while awaiting completion of callback threads - trying again...");
        }
      }

      // Finishing off
      manager.complete();

Thanks in Advance
Paco

unsavorypeople · 2021-03-23T21:04:15Z

Another vote here for this ability to be built into the aws SDK. This is a huge hole with no good work arounds.

@alexmojaki thank you for your library, but we are hesitant to use this in a production system because we won't be able to get support for it if aws makes a breaking change in the SDK. Also, it looks like your library requires the V1 SDK, and we are already on V2.

Quite frankly, I'm a bit surprised that aws has not provided a solution to this basic functionality in the 5+ years this issue has been open.

sdonovanuk · 2021-04-22T19:29:42Z

C'mon AWS, get it done. The ability to upload from a stream of unknown length is especially useful, especially if you're trying to, for example, (a) stream the result of a database bulk read to S3, or (b) streaming a database backup to S3, etc.

rpiccoli · 2021-08-26T11:00:45Z

Is there any plan to implement this?

dkamakin · 2021-12-20T11:50:36Z

Is the work still in progress?

sdonovanuk · 2021-12-20T18:41:32Z

@dkamakin -- it can be emulated. Start with your InputStream. Initiate a multi-part upload, and set the part-size. Then, take blocks of your input stream to match the part-size, then upload the part using a ByteArrayInputStream. Repeat until done, then complete the multipart-upload. Remember to abort if anything goes wrong. It's actually not that much code. In my case, am working with a part-size of 5Mb, the smallest allowed -- as the largest file I'm trying to upload in this case is ~15Mb. Yeah, it does mean each thread you have uploading has a 5Mb buffer, but, not terrible.

alexmojaki · 2021-12-20T18:56:46Z

@sdonovanuk that's what https://github.com/alexmojaki/s3-stream-upload does, I don't think you should recommend that people do it themselves, it's not that easy.

sdonovanuk · 2021-12-20T19:52:41Z

@alexmojaki -- dunno. The simplest form of a a stream manager would just take: bucket, key, InputStream -- and nothing else, except maybe a buffer specification.

djchapm · 2022-02-23T19:19:10Z

@alexmojaki - great that you wrote code to work around a limitation of the AWS SDK - but please stop promoting as a solution - really needs to be part of the AWS SDK, supported by AWS which is preferable as things change. Might be better if you participate in the V2 AWS SDK on github and submit a PR. No way I can put a dependency on "com.github.alexmojaki" library from github. API would be much simplified if TransferManager could support streaming.

rpiccoli · 2022-02-23T20:08:54Z

@djchapm , this is a very negative statement for a successful OSS contribution (173 stars for a small module, as you said, from 'com.github.alexmojaki'). The effort to work around AWS SDK limitation was shared for free to the community, and it is an actual solution for at least 173 products around the world. I previously added dependency to 'com.github.alexmojaki' successfully, what never broke any compliance with security or licensing. I am sure the authors does not have any objection for their code to be added to AWS official SDK, that's why he uses MIT license.

djchapm · 2022-02-23T20:59:39Z

Hey @rpiccoli and @alexmojaki - not meant to offend at all. It is great work! Not sure how you took that so negatively. We have been doing something similar. I've been following this issue #474 for a couple years now - lately I saw that it's on the agenda to be included in AWS SDK v2 TransferManager which is awesome - it links to this issue and coming here I see more recent comments from alexmojaki about his contribution as a solution - which I've seen on many threads about this topic. I simply don't want it to detract from the promise of a more native and supported solution that takes advantage of the AWS async/bandwidth improvements in v2. Writing to S3 is currently our biggest performance bottleneck - by far. Having to manage parts and flush in mem buffers requires a bit of overhead that could be done better if we can get a little closer to the networking side of the aws connection. I am not sure why AWS doesn't pull the trigger on implementing this - this ticket is over 5 years old and so many people have this issue - like a commenter mentions above - "C'mon AWS, git it done.".

rpiccoli · 2022-02-23T21:38:15Z

Now this is a positive and constructive statement! Indeed I totally agree with you @djchapm: "c'mon AWS, get it done!". I would happily shift to non-blocking async client.

djchapm · 2022-06-16T15:49:25Z

Hey - still tracking this one - did a quick demo based on what I'm seeing in the v2 transfer manager preview and #2817 and #2814... Couldn't get it to work - same issue reported based on having content-size up front. Digging into the libraries - the fromPublisher ends up constructing a meta request that hardcodes content size to zero - so you'd think this would be intended for a publisher style interface. Here's my code and versions of everything for the test - if we could get this to work then I think most everyone's related feature request would be resolved.

Tagging @zoewangg and @Bennett-Lynch for any insights.

/**
 * Java 17
 * s3-transfer-manager-2.17.210-PREVIEW.jar
 * sdk-core-2.17.210.jar
 * reactor-core-3.4.13.jar
 * Native CLIv2 on MacOS monterey: 
 *    aws-cli/2.7.7 Python/3.9.13 Darwin/21.5.0 source/x86_64
 */
class AWSTransferManagerTest {

    public static void main (String[] args) {
        S3ClientConfiguration s3TransferConfig = S3ClientConfiguration.builder()
                .targetThroughputInGbps(20.0)
                .minimumPartSizeInBytes(1024L)
                .build();
        S3TransferManager transferManager = S3TransferManager.builder()
                .s3ClientConfiguration(s3TransferConfig)
                .build();
        Flux<ByteBuffer> flux = Flux.just("one", "two", "three")
                .map(val -> ByteBuffer.wrap(val.getBytes()));
        //verify flux:
        //flux.subscribe(System.out::println);

        Log.initLoggingToFile(Log.LogLevel.Trace, "s3.logs");
        //output in s3.logs:
        // [INFO] [2022-06-16T15:19:49Z] [00007000056fd000] [S3Client] - id=0x7fab24928d20 Initiating making of meta request
        // [ERROR] [2022-06-16T15:19:49Z] [00007000056fd000] [S3MetaRequest] - Could not create auto-ranged-put meta request; there is no Content-Length header present.
        // [ERROR] [2022-06-16T15:19:49Z] [00007000056fd000] [S3Client] - id=0x7fab24928d20: Could not create new meta request.
        // [WARN] [2022-06-16T15:19:50Z] [0000700005d0f000] [JavaCrtGeneral] - Not all native threads were successfully joined during gentle shutdown.  Memory may be leaked.

        Upload upload =
                transferManager.upload(UploadRequest.builder()
                        .putObjectRequest(b -> b.bucket("bucket").key("tmp/flux.bin"))
                        .requestBody(AsyncRequestBody.fromPublisher(flux))
                        .overrideConfiguration(b -> b.addListener(LoggingTransferListener.create()))
                        .build());
        CompletedUpload completedUpload = upload.completionFuture().join();
    }
}

Output in console:

Caused by: software.amazon.awssdk.crt.CrtRuntimeException: S3Client.aws_s3_client_make_meta_request: creating aws_s3_meta_request failed (aws_last_error: AWS_ERROR_INVALID_ARGUMENT(34), An invalid argument was passed to a function.) AWS_ERROR_INVALID_ARGUMENT(34)
	at software.amazon.awssdk.crt.s3.S3Client.s3ClientMakeMetaRequest(Native Method)
	at software.amazon.awssdk.crt.s3.S3Client.makeMetaRequest(S3Client.java:70)
	at software.amazon.awssdk.services.s3.internal.crt.S3CrtAsyncHttpClient.execute(S3CrtAsyncHttpClient.java:97)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.doExecuteHttpRequest(MakeAsyncHttpRequestStage.java:175)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.executeHttpRequest(MakeAsyncHttpRequestStage.java:147)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$execute$1(MakeAsyncHttpRequestStage.java:99)
	at java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:757)
	at java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:735)
	at java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2182)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.execute(MakeAsyncHttpRequestStage.java:95)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.execute(MakeAsyncHttpRequestStage.java:60)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncApiCallAttemptMetricCollectionStage.execute(AsyncApiCallAttemptMetricCollectionStage.java:55)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncApiCallAttemptMetricCollectionStage.execute(AsyncApiCallAttemptMetricCollectionStage.java:37)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.attemptExecute(AsyncRetryableStage.java:144)
	... 24 more

djchapm · 2022-06-21T19:43:25Z

There is hope for this yet in aws/aws-sdk-java-v2#139. I'll stop watching this one.

djchapm · 2023-07-21T15:05:20Z

@david-at-aws - this functionality is now present in aws sdk v2 - see above ticket. I think this can finally be closed/completed.

debora-ito · 2024-07-19T21:30:53Z

Java SDK 2.x TransferManager supports uploads with unknown content-length. Read more about it in the Java SDK 2.x Developer Guide, and open a new issue in the v2 repo if you have any issue or feedback.

Reference:

Announcing end-of-support for AWS SDK for Java v1.x effective December 31, 2025 - blog post

github-actions · 2024-07-19T21:31:10Z

This issue is now closed.

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

david-at-aws added the feature-request A feature should be added or improved. label Jul 30, 2015

spfink mentioned this issue Sep 13, 2017

Transfer Manager aws/aws-sdk-java-v2#37

Closed

jonescc mentioned this issue Nov 1, 2017

S3 document mime type aodn/aws-wps#116

Merged

varunnvs92 mentioned this issue Sep 4, 2018

S3 upload stream with http throws stream mark and reset error #1748

Closed

debora-ito mentioned this issue Oct 3, 2020

"java.net.SocketException: Broken pipe (Write failed): error on file upload (reproducible + workaround) #2408

Closed

debora-ito mentioned this issue May 17, 2021

uploadFileList without directory in TransferManager #2572

Closed

masokan mentioned this issue May 31, 2021

Potential problem in uploading data using AsyncRequestBody.fromPublisher() aws/aws-sdk-java-v2#2493

Closed

djchapm mentioned this issue Jun 21, 2022

Support S3TransferManager upload using Flux<ByteBuffer> aws/aws-sdk-java-v2#2714

Closed

debora-ito closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mutipart upload from InputStream (without supplying content-length) uses putObject(). #474

Mutipart upload from InputStream (without supplying content-length) uses putObject(). #474

selaamobee commented Jul 30, 2015

david-at-aws commented Jul 30, 2015

selaamobee commented Jul 31, 2015

david-at-aws commented Jul 31, 2015

alexmojaki commented Oct 22, 2015

selaamobee commented Oct 22, 2015

mikaelnousiainen commented Sep 14, 2016

shokeenAnil commented Feb 16, 2017

jhaber commented Mar 28, 2017

ronreiter commented Mar 30, 2018

eocron commented Oct 9, 2018 •

edited

Loading

jcogilvie commented Mar 6, 2019

svobol13 commented Jun 11, 2019 •

edited

Loading

alexmojaki commented Jun 11, 2019

BramDeCneudt commented Aug 7, 2019

prameshbhattarai commented Sep 29, 2020

fkoner commented Nov 11, 2020 •

edited

Loading

fkoner commented Nov 11, 2020

unsavorypeople commented Mar 23, 2021

sdonovanuk commented Apr 22, 2021 •

edited

Loading

rpiccoli commented Aug 26, 2021

dkamakin commented Dec 20, 2021

sdonovanuk commented Dec 20, 2021

alexmojaki commented Dec 20, 2021

sdonovanuk commented Dec 20, 2021

djchapm commented Feb 23, 2022

rpiccoli commented Feb 23, 2022

djchapm commented Feb 23, 2022

rpiccoli commented Feb 23, 2022

djchapm commented Jun 16, 2022

djchapm commented Jun 21, 2022

djchapm commented Jul 21, 2023

debora-ito commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

Mutipart upload from InputStream (without supplying content-length) uses putObject(). #474

Mutipart upload from InputStream (without supplying content-length) uses putObject(). #474

Comments

selaamobee commented Jul 30, 2015

david-at-aws commented Jul 30, 2015

selaamobee commented Jul 31, 2015

david-at-aws commented Jul 31, 2015

alexmojaki commented Oct 22, 2015

selaamobee commented Oct 22, 2015

mikaelnousiainen commented Sep 14, 2016

shokeenAnil commented Feb 16, 2017

jhaber commented Mar 28, 2017

ronreiter commented Mar 30, 2018

eocron commented Oct 9, 2018 • edited Loading

jcogilvie commented Mar 6, 2019

svobol13 commented Jun 11, 2019 • edited Loading

alexmojaki commented Jun 11, 2019

BramDeCneudt commented Aug 7, 2019

prameshbhattarai commented Sep 29, 2020

fkoner commented Nov 11, 2020 • edited Loading

fkoner commented Nov 11, 2020

unsavorypeople commented Mar 23, 2021

sdonovanuk commented Apr 22, 2021 • edited Loading

rpiccoli commented Aug 26, 2021

dkamakin commented Dec 20, 2021

sdonovanuk commented Dec 20, 2021

alexmojaki commented Dec 20, 2021

sdonovanuk commented Dec 20, 2021

djchapm commented Feb 23, 2022

rpiccoli commented Feb 23, 2022

djchapm commented Feb 23, 2022

rpiccoli commented Feb 23, 2022

djchapm commented Jun 16, 2022

djchapm commented Jun 21, 2022

djchapm commented Jul 21, 2023

debora-ito commented Jul 19, 2024

Reference:

github-actions bot commented Jul 19, 2024

eocron commented Oct 9, 2018 •

edited

Loading

svobol13 commented Jun 11, 2019 •

edited

Loading

fkoner commented Nov 11, 2020 •

edited

Loading

sdonovanuk commented Apr 22, 2021 •

edited

Loading