-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 sync: s3 -> local redownloads unchanged files #648
Comments
Could you run If it's syncing files it will print a log message as to why it's doing so:
I'd be curious to see what the log messages say. |
Output:
These files exist locally: the first
Additional The files in my log example are large movie files but the issue doesn't appear to be isolated to one file size or type. There are plenty of small thumbnail images and other file types, too. |
I discovered this separately and opened it as a case through AWS support. See: Case 163857941 The problem with aws s3 sync cli is with dates. When mirroring with aws s3 sync from s3 to a local ec2 linux file system some of the files in the local filesystem are not getting the timestamp updated to match the timestamp on s3. Looking at the log file which is previously mentioned in Case 163857941: s3://xxxx/case163857941/debugSyncLog.txt Michael |
I'm having a similar issue, but going the opposite direction. I want to backup a server. There's 500+ GB of data, so I run the command: aws s3 sync /mnt/main/backup s3://mybucket/backup The first time it rolls through pretty well. It goes at 2,200+ parts completed and stops. I figure there's probably a limit on the amount of parts transacted per session, no biggie, just start the sync again right? Wrong. The aws client starts resyncing from the top, gets 2200+ parts in and stops. Needless to say I have 1 subdirectory of backup synced. I have versioning control on and I can confirm that new versions are being made of these files every time sync runs. I suspended version control to see if it would help, but it appears to be doing the same thing. The only thing I can think of to do next is to install s3fs, mount the bucket as a fs mountpoint and run rsync -azut. I rather not have to do that. Any ideas?! |
My recollection is that the date-related problem occurs both ways. |
I believe this is fixed in the latest version of the CLI? Can anyone confirm if you're still seeing this on later versions of the CLI (>= 1.4.1)? If so, I'll reopen and take another look. |
Hard to say. I updated awscli and ran sync a couple times just to be sure but it was still downloading files again. It still seems to only affect a subset of files every time but there aren't patterns apparent within the problem files.... files of all types, sizes, and dates are affected.
I also ran a limited sync on a different box where there has always been a problem file (eet_poster.pdf) and it's possible this case is something related to case sensitivity
But I was still having the same problem (files not found when they actually exist) with plenty of other files:
|
Not sure why this has been marked closed. Currently experiencing a similar issue: |
@jamesls I'm also not sure this is fixed. I can run an s3 sync twice in a row. Out of ~10k files, the same ~80 get re-downloaded every time. Nothing comes up for the 'comparator' grep, though. |
I am having the same issues once "--exact-timestamps" is turned on. Randomly, multiple old files in the same bucket download to local machine. Next sync they're fine. Ran it with --debug | grep 'modified time' and noticed that the timestamps were all off by 1 second every time this anomaly occurs. The comparison is in this file and I think ill just mod it to ignore diffs of < 2 seconds. That'll work for me, anyway. Hope this helps someone else. |
Even with --size-only I get dupes. Some of them are files that have the same filename with different capitalizations, so that may be one issue. |
Different capitalization == different file in linux, thus probably with awscli, even on windows. I would definitely expect that behavior on Linux. |
@reverie - run the sync with debug on and try to find the lines where the program is comparing size and stamp and filename. Look for the reason it wants to sync a particular file. |
Re: capitalization, I'm not surprised that it's treating them as different files. I'm surprised that it's re-syncing them every time. Anyway, that's only some of them. Looks from the log like "file does not exist at destination" is the given reason for syncing. |
Looks like I've created a workaround for my issue by changing line 40 in exacttimestamps.py from |
Actually, |
This continues to be an issue, with the latest aws cli installed I still have about 100 files out of roughly 200k that continue to download on every sync. |
Did you edit exacttimestamps.py as indicated above? This should fix the issue until you update awscli again |
On one of my servers this file is located at /usr/local/lib/python2.7/dist-packages/awscli/customizations/s3/syncstrategy/exacttimestamps.py |
I wan't using the --exact-timestamps flag, but I'll check it out |
That didn't fix it for me, I also tried with "--dryrun --debug 2>&1 | grep comparator" and got no output |
Do you get anything relevant without piping to grep? Also in sizeonly.py - you could mod the debug output to include src_file.size and dst_file.size to see exact what values the two sizes are on the files that are continually resyncing. |
Ok, I created a second local directory so I could target one of the directories, instead of running through 260k files every time, I have the same problem, so I grepped for a specific filename and got this (I added the astrix) : MainThread - awscli.customizations.s3.syncstrategy.base - DEBUG - syncing: //wp-content/uploads/2017/03/DOC_image10-1-1-150x150.png -> /Volumes/Backup/tmp_backup/DOC_image10-1-1-150x150.png, file does not exist at destination I've run it 3 times, and get the same result every time (I run an actual sync in between the debugs) |
So evidently, the files are never being created locally, I did double check and that seems to be the case. That file is never added |
huh. Well that explains why they try to download every time, but not why they're failing to write in the first place. Dont suppose the destination is windows and the total filename with folders length is too long? If windows, run procmon to see why its failing to write. If linux, do you have full ownership of target dir? |
Nope, destination is a mac, and that's not even the longest filename in the directory, I checked and there are longer filenames that sync successfully |
I've also checked S3 to make sure the file isn't corrupt or something, but I can view it without issue. The only thing I can see is that most of the other files have been updated at least once since this bucket was set to be revisioned, but I'm not sure why that would cause an issue |
How about deleting and re-uploading the problem files in the bucket? |
wow, no luck with that either. Well, at least it's friday, maybe I'll have a eureka moment over the weekend |
Happening for me too. When I do a sync from the bucket to a local disc it re-downloads every single file each time. No laughs. |
Saw this too, and it was because the files were from the future. In my case, 5 of 10,000 files kept re-downloading, and these 5 files were dated 2018, and its currently the year 2017. |
Seeing the same. |
Similar issue here when mirroring local folder to s3 bucket. Aws cli re-uploads a modified same file several times ( Ec2 instance, CentOS Linux release 7.5.1804, aws-cli/1.16.89 ) |
Same problem from S3 to S3 all the files already has been copied using sync, and when I re-run the same sync lots of files are copied again! I'm using |
Same issue when syncing local directory to S3 on Windows. Of about 8GB of data, it wants to re-copy about 1GB every time. Even though the files exist on the destination the dryrun debug command is reporting that the files don't exist. Here's just one example:
|
We store a pile of files in S3 and it's handy to have a local copy of our S3 buckets for development and backup. Upon first glance
aws s3 sync
looks like it'll work.I ran sync on our entire bucket and it completed successfully; it downloaded a whole bucket to local disk. The second time I ran the command it was redownloaded some files that haven't changed (on S3 or locally) alongside the new ones.
These files were just downloaded with the first
sync
. The local modified time & size match S3's values.While I never rule out the possibility of user error I don't see an obvious cause. The first S3->Local sync completed normally, I run it again and it redownloads some files every time that haven't changed. Not all, just some. And it's the same files redownloaded every time.
My cli version is
aws-cli/1.2.13 Python/2.7.6 Darwin/10.8.0
This may or may not be related to issue #599, but I won't personally make that call.
The text was updated successfully, but these errors were encountered: