-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the ability to stream data using cp
.
#903
Conversation
# Need to save the data to be able to check the etag for a stream | ||
# becuase once the data is written to the stream there is no | ||
# undoing it. | ||
payload = write_to_file(None, etag, md5, file_chunks, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little unclear to me here - is this actually reading in the entire contents of the file to be printed later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is if the object is being streamed to standard out. This is needed because if you are writing an object out to stdout while doing the md5 calculation, there is no way to erase the data sent to stdout if there is an md5error and needs to be retried. Therefore, I write to a buffer that is later written to stdout once I have ensured the md5 is correct. On the other hand for a file, I write to file as I calculate to md5 because I can delete the file if the md5's do not match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of concerning to me given the size of files people will put into S3. Have you considered using a temporary file? You could only use temp files if the download is large, and it would have the same behavior as a normal file except it is eventually written to stdout and removed from disk. What about writing a message out to stderr and returning a non-zero exit code (leaving retries up to the calling script if they want to use stdout)? Any other ideas you considered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for a single download and the cutoff for multipart threshold is 8 MB and so there will be at most that much in memory (for a non-multipart download) since you can only perform an operation on one file when streaming. This is memory issue is more concerning with multipart operations, which I will discuss at the bottom of the comment section. On a side note, I like the idea of temporary files
Overall I'd say this looks pretty good. My main concerns:
|
Yeah that's a good idea. Here is a synopsis of all the different transfer scenarios with worst-case memory usage. upload:
Maximum memory estimation: 8 MB (the maximum size of a non-multipart upload) Multipart Upload:
Maximum memory estimation: 50 MB (5 MB chunks * 10 chunks in queue at a time) Download:
Maximum memory estimation: 8 MB (the maximum size of a non-multipart upload) Mutlipart Download:
Maximum memory estimation: 20 MB = 1MB (Size of items in the queue which are chunks of a thread's specified part) * 20 (the maximum size of the write queue) Conclusion: Given the fact that temporary files can save me memory, are there any drawbacks I should be aware of? If there is not, I will probably convert everywhere I use a |
Looks good. |
This feature enables users to stream from stdin to s3 or from s3 to stdout. Streaming large files is both multithreaded and uses multipart transfers. The streaming feature is limited to single file ``cp`` commands.
This includes adding more tests, simplifying the code, and some PEP8 cleaning.
Added the ability to stream data using ``cp``.
This feature enables users to stream from stdin to s3 or from s3 to stdout.
Streaming large files is both multithreaded and uses multipart transfers.
The streaming feature is limited to single file
cp
commands.You can look at some of the documentation changes to see how to run the commands.
Here is a synopsis:
For uploading a stream from stdin to s3, use:
aws s3 cp - s3://my-bucket/stream
For downloading an s3 object as a stdout stream, use:
aws s3 cp s3://my-bucket/stream -
So for example, if I had the object
s3://my-bucket/stream
, I could run this command:aws s3 cp s3://my-bucket/stream - | aws s3 cp - s3://my-bucket/new-stream
This command would download the object
stream
from the bucketmy-bucket
and write it to stdout. Then the data in stdout will be piped to stdin and uploaded from stdin to an object with the keynew-stream
in the s3 bucketmy-bucket
.cc @jamesls @danielgtaylor