Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using aws s3 cp in multiple process/threading #1018

Closed
delbinwen opened this issue Nov 21, 2014 · 6 comments
Closed

Using aws s3 cp in multiple process/threading #1018

delbinwen opened this issue Nov 21, 2014 · 6 comments
Labels

Comments

@delbinwen
Copy link

Hi,

First, please let me describe our aws cli usage scenario. We store a set of files on s3 with same prefix, say s3://my_bucket/, and a set of EC2 instances will copy the file, process it, and upload it to another bucket. We use SWF, and therefore multiple workers on EC2 will actually do the tasks, and we use aws s3 cp of course. I think this scenario is quite common.

It works well with less than 5 files to process. However, if we process many files (100+), things could go wrong. There are two observations:

  1. aws s3 cp looks like stuck, it could take several hours to download a single file (less than 1 GB), or it never returns. We use basic command syntax, like aws s3 cp s3://my_bucket/file.1 /tmp/
  2. aws s3 cp returns 0 code but did not complete all parts. We capture the standout in our log and looks like below:
    'stdout': u'Completed 1 of 685 part(s) with 1 file(s) remaining\rCompleted 2 of 685 part(s) with 1 file(s) remaining\rCompleted 3 of 685 part(s) with 1 file(s) remaining\rCompleted 4 of 685 part(s) with 1 file(s) remaining\rCompleted 5 of 685 part(s) with 1 file(s) remaining\rCompleted 6 of 685 part(s) with 1 file(s) remaining\rCompleted 7 of 685 part(s) with 1 file(s) remaining\rCompleted 8 of 685 part(s) with 1 file(s) remaining\rCompleted 9 of 685 part(s) with 1 file(s) remaining\rCompleted 10 of 685 part(s) with 1 file(s) remaining\r\n'

I think --debug option might be required to help investigation. I will try to work on this, but this issue is quite urgent and therefore I want to know if there is any other information that we could provide to assist this.

Thanks,
Wesley

@kyleknap
Copy link
Contributor

When do you try to downloading the files from s3? Is it after the ec2 instances copy the object to the new bucket or is it while the ec2 instances are performing the copy?

If you are trying to download objects from a location in a bucket while objects are being copied to the same location at the same time, there could be potential problems in downloading the files. The main issue being we only perform one ListObjects (or more if pagination is required) for all of the objects in the bucket. So, if objects are added during this ListObjects it may throw off the process.

@delbinwen
Copy link
Author

We didn't upload new files to s3://my_bucket when EC2 instances are processing. In recent runs, we uploaded the files to s3://my_bucket one or two weeks ago, and we only instructed EC2 to process them, and we did not upload the processed files back to the same bucket.

As we observed, we have problem when we have multiple processes use aws s3 cp to copy the file from same bucket, it could cause issue. Our current configuration will have 12 (EC2 instances) * 8 (8 processes on each EC2) = 96 processes to access same s3 bucket with aws cli concurrently.

@kyleknap
Copy link
Contributor

Interesting. I believe this is due to the fact that the aws s3 cp is already multi-threaded. It uses 10 threads to preform s3 related actions. So on a single ec2 instance, you could have 80 threads (10 threads * 8 processes) in total that are all trying to make requests to s3. This could cause your threads to starve in the sense that there is not enough bandwidth to make all of these requests simultaneously.

This explanation makes sense given your observation:
"It works well with less than 5 files to process. However, if we process many files (100+), things could go wrong."

If there are only 5 s3 objects being downloaded, there would only be 40 threads making requests (5 threads * 8 processes), given the files are not being uploaded via multi-parts. Since this is half of the thread maximum that I stated before, the threads are starving less. However if you are downloading 100+ files, you will be hitting that 80 thread maximum.

So, I guess my question is now: Is it possible to have a fewer amount of processes running the aws s3 cp commands say 1 or 2 as opposed to 8? I believe if you can cut this down, you will have better results.

@delbinwen
Copy link
Author

Thank you for the reply! It's good to know how aws s3 cp handles request.

After doing more investigation, we met network bandwidth issue because all EC2 instances are behind NAT, and it was resolved.

In the recent runs we did not see incomplete download issue. I'm not sure if the resolved bandwidth issue could help this one. Is there an timeout mechanism implemented that could be related to this issue?

@kyleknap
Copy link
Contributor

kyleknap commented Dec 1, 2014

S3 usually throws RequestTimeout errors when the socket has not been read or written to in a while. This could happen if the threads are starving each other as I explained in my previous comment. However, this error will be displayed to the user. Are you seeing those types of errors? Or is it a different type of errors?

@jamesls
Copy link
Member

jamesls commented Jan 15, 2015

Consolidating issues into tracking issues. There's two that are relevant here:

  1. Limit bandwidth: Add ability to limit bandwidth for S3 uploads/downloads #1090
  2. Set retry configuration: Add ability for S3 commands to increase retry count #1092
  3. Limit concurrency: Limit number of parallel s3 transfers #907

Closing in favor of those issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants