-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 streaming with unknown content size #139
Comments
I'd like to vote for automatic reconnect on download. So you'd have something that looks like a BufferedInputStream. In one mode each BufferSize chunk would be a GetRange and it goes right into the byte array. That way if the client is slow to process the stream they will not get any connection reset errors. With a good buffer size (maybe even a buffer array to load some buffers in parallel) it should be super fast while being very stable. If there is a Connection read error in the short time it takes to copy the range into the byte buffer then I would want it to use the client configured Retry configuration to retry. |
Support for retrying with non-resettable InputStreams would be nice. See aws/aws-sdk-java#427 |
Additionally, it looks to me like TransferManager doesn't support doing multi-part uploads of data with an unknown size. This makes it pretty useless for big data. |
Any updates on that issue? |
@laymain |
I understand, thank you for the answer. |
Tracking in #37 |
Adding a use case - code below is based on code currently available in TransferManager 2.17.210-PREVIEW... there is a 'fromPublisher' method that accepts Publisher <ByteBuffer> - so below is simple demo code and my env details. Another use case, and more of a priority - uploading compressed content (or anything else) via java.nio.OutputStream. i.e. we will create a java GZipOutputStream - write to it using ByteBuffer writes - when we close the output stream, expect TransferManager to wrap things up regarding parts/content-length etc. Hoping for code like: Here is demo of native error we get (shown in the commented code) - it's based on no content-length in the put object request. This works when using AsyncRequestBody.fromXXX where it is something other than a Publisher (File or String). So goal would be to update the content length onClose(), and upload parts based on the S3ClientConfiguration settings.
|
Hey check this out - awslabs/aws-c-s3#285 @yasminetalby / @zoewangg Can we maybe move up the priority on this considering the action on the CRT Native Client? Thanks - also I have some updated code, the stuff above was created with s3 v2 beta code, locally I have updated for v2.20.51. Regards! |
Hey @djchapm, yes! We are aware of this change and will bump CRT Java binding version once it's available. There should be no additional change needed on the Java SDK side. |
Looks like awslabs/aws-c-s3#285 is merged - @zoewangg can you confirm which version of the sdk will contain this change? |
Hi @djchapm, the change is not available in the Java binding yet, we'll let you know once it's ready. |
@zoewangg The latest release of aws-crt-java@0.22.1 supports the unknown content-length now. |
Submitted PR to bump CRT version #4065 |
Hey @zoewangg - it's working great! I haven't done any performance comparisons but will report. Definitely reduces a ton of code though. So far I've only tried with a few MBs for relatively short durations. I'll want to try and keep things open for a few Gigs and hours before we conclude anything. What do you think about adding the ability to update the metadata just before we trigger the completion of the output stream? I've tried digging into this and obtaining a handle to the metadata map but it is converted to an unmodifiable map in the UploadRequest, and I'm not sure where in the process the meta data is actually uploaded. If it is at the end though - do you think this would be a possibility? Our only alternative right now is to do a copy to update metadata which in our experience can take from 1 to 20 minutes. |
I have an implementation that works, but I can't figure out what is taking so long once I mark the stream complete. I have a TransferListener hooked up and there is maybe only one more call to listener.bytesTransferred after I have closed. But then it seems to go into a black hole for about a minute before calling the transferComplete method. I don't see any way to forcefully tell the Upload/s3Client/TransferManager that we're done and to get it over with. I can't really debug because it's all happening async in native-land. Any thoughts? I'm only dealing with a 140MB upload. For our solution/metrics this amounts to about 7000 "messages" per second whereas the old solution in aws sdk 1 where we manage the multiparts and keep track of content size and then update the metadata after the fact using a copy - that whole process achieves 224000 messages per second in the same exact test. So this bottleneck at the end seems like a pretty big issue. :( |
Having some random failures etc... and some slowness in completing uploads. I uploaded a java test project and created issue: awslabs/aws-c-s3#317 |
Hi @djchapm yeah, we will work with CRT team to investigate this issue. To answer your previous question (sorry for the delay)
Just for my understanding, is there any reason you can't set the metadata in the original request? I think metadata is being sent in |
Yea - we're uploading metadata/indicators to our files based on what's been processed in the file for disaster recovery scenarios in streaming. Those indicators are typically cumulative counts or 'last entry' markers in the object which can't be calculated ahead of time. |
My bad... it is open. |
Hey @zoewangg - I can confirm that this is working. I've done several rounds of testing with streaming to S3OutputStream of up to 17Gigs in size. Potential drawback for some people is that the meta data for content size is not attached to the object in S3 - might be good thing to add as feature request. For anyone looking for example code you can see a temp solution I posted on related ticket awslabs/aws-c-s3#317 Thanks! |
@djchapm glad that this is working for you. Marking this to auto-close soon. Please let us know if there's any further questions. |
Is this issue supposed to be already fixed in the current AWS Java SDK (2.20.140), because in my tests today I still get an error |
I'm still seeing this error on |
Unclear what exactly this would mean, some investigation is required. But for uploads I imagine we'd do some buffering of the content, if it's greater than 5MB we would do a multipart upload with some retries of parts. For downloads this might be a special input stream returned to the caller that can reconnect on failure by doing a range GET to retrieve the rest of the bytes.
The text was updated successfully, but these errors were encountered: