Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the ability to stream data using cp. #903

Merged
merged 5 commits into from
Sep 29, 2014
Merged

Conversation

kyleknap
Copy link
Contributor

@kyleknap kyleknap commented Sep 2, 2014

This feature enables users to stream from stdin to s3 or from s3 to stdout.
Streaming large files is both multithreaded and uses multipart transfers.
The streaming feature is limited to single file cp commands.

You can look at some of the documentation changes to see how to run the commands.
Here is a synopsis:
For uploading a stream from stdin to s3, use:
aws s3 cp - s3://my-bucket/stream

For downloading an s3 object as a stdout stream, use:
aws s3 cp s3://my-bucket/stream -

So for example, if I had the object s3://my-bucket/stream, I could run this command:
aws s3 cp s3://my-bucket/stream - | aws s3 cp - s3://my-bucket/new-stream

This command would download the object stream from the bucket my-bucket and write it to stdout. Then the data in stdout will be piped to stdin and uploaded from stdin to an object with the key new-stream in the s3 bucket my-bucket.

cc @jamesls @danielgtaylor

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling abce027 on kyleknap:streams into 2bdb58b on aws:develop.

# Need to save the data to be able to check the etag for a stream
# becuase once the data is written to the stream there is no
# undoing it.
payload = write_to_file(None, etag, md5, file_chunks, True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little unclear to me here - is this actually reading in the entire contents of the file to be printed later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is if the object is being streamed to standard out. This is needed because if you are writing an object out to stdout while doing the md5 calculation, there is no way to erase the data sent to stdout if there is an md5error and needs to be retried. Therefore, I write to a buffer that is later written to stdout once I have ensured the md5 is correct. On the other hand for a file, I write to file as I calculate to md5 because I can delete the file if the md5's do not match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of concerning to me given the size of files people will put into S3. Have you considered using a temporary file? You could only use temp files if the download is large, and it would have the same behavior as a normal file except it is eventually written to stdout and removed from disk. What about writing a message out to stderr and returning a non-zero exit code (leaving retries up to the calling script if they want to use stdout)? Any other ideas you considered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for a single download and the cutoff for multipart threshold is 8 MB and so there will be at most that much in memory (for a non-multipart download) since you can only perform an operation on one file when streaming. This is memory issue is more concerning with multipart operations, which I will discuss at the bottom of the comment section. On a side note, I like the idea of temporary files

@danielgtaylor
Copy link
Contributor

Overall I'd say this looks pretty good. My main concerns:

  1. Reading the entire file into memory
  2. It could use a little high-level implementation documentation. What happens internally when a stream is read in, or when a stream is output? How are chunks handled? How is multipart handled and when is it used? Stuff like that as it's a little tough to follow at the moment.

@kyleknap
Copy link
Contributor Author

kyleknap commented Sep 4, 2014

Yeah that's a good idea. Here is a synopsis of all the different transfer scenarios with worst-case memory usage.

upload:

  1. A _pull_from_stream in s3handler, reads in data from stdin and inserts the data into a BytesIO object.
  2. If the length of the BytesIO object is less than the multipart threshold, then you simply upload the BytesIO object.

Maximum memory estimation: 8 MB (the maximum size of a non-multipart upload)

Multipart Upload:

  1. Repeat step 1 for upload
  2. If there is more data to read in stdin, begin multipart upload.
  3. Submit a create multipart upload task and soon after submit a task to upload the first part of stdin taken in via step 1.
  4. Continue to pull from stdin stream and place parts to upload in a queue to be processed. All of these parts are BytesIO objects and the maximum size of the queue is 10 for this operation. The pulling from stream operation must wait if the queue is full.

Maximum memory estimation: 50 MB (5 MB chunks * 10 chunks in queue at a time)

Download:

  1. As downloading object, calculate md5 and store into BytesIO.
  2. Once we know the md5 is correct we write to stdout.

Maximum memory estimation: 8 MB (the maximum size of a non-multipart upload)

Mutlipart Download:

  1. Each thread begins performing a range-download on a specific part of the object.
  2. All threads must wait till till its turn has arrived (meaning the part it is currently downloading is the next part required in the stream)
  3. Once it is a thread's turn, it reads its part in chunks and places each chunk on a queue in order of being read.
  4. An IO thread comes and takes these parts off of the queue and writes them in the same order that they came in.

Maximum memory estimation: 20 MB = 1MB (Size of items in the queue which are chunks of a thread's specified part) * 20 (the maximum size of the write queue)

Conclusion:
Currently 50 MB is the maximum amount of memory used when streaming. What are your thoughts on that amount? I originally did not think of using temporary files. But now that I think of it, it will be very useful when the streams get very large. Currently with 5 MB parts, the maximum stream size that someone can upload is 5 GB (which is too small). That is why I added a --expect-size parameter so that the chunksize can be updated such that the entire upload can be fit in less than 1000 parts. The issue though is that the chunksizes will increase from the originally expected 5 MB size so that may utilize too much memory if say the stream was like a TB in size.

Given the fact that temporary files can save me memory, are there any drawbacks I should be aware of? If there is not, I will probably convert everywhere I use a BytesIO thread to temporary file.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.04%) when pulling 19ea686 on kyleknap:streams into 999ad81 on aws:develop.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling 0e1ff2d on kyleknap:streams into 999ad81 on aws:develop.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) when pulling 9022a59 on kyleknap:streams into 999ad81 on aws:develop.

@jamesls
Copy link
Member

jamesls commented Sep 20, 2014

:shipit: Looks good.

This feature enables users to stream from stdin to s3 or from s3 to stdout.
Streaming large files is both multithreaded and uses multipart transfers.
The streaming feature is limited to single file ``cp`` commands.
This includes adding more tests, simplifying the code, and some PEP8 cleaning.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) when pulling 4716948 on kyleknap:streams into ab363c3 on aws:develop.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) when pulling 4716948 on kyleknap:streams into ab363c3 on aws:develop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants