Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3 sync repeatedly downloads files without modification #5730

Closed
shankerwangmiao opened this issue Nov 17, 2020 · 9 comments
Closed

s3 sync repeatedly downloads files without modification #5730

shankerwangmiao opened this issue Nov 17, 2020 · 9 comments

Comments

@shankerwangmiao
Copy link

shankerwangmiao commented Nov 17, 2020

When s3 sync is invoked to keep local copy of file up to date with a remote s3 bucket, and --exact-timestamps is used, files which are not modified in the s3 bucket get repeatedly downloaded.

It is caused by the following code snippet:

last_update_tuple = self._last_modified_time.timetuple()
mod_timestamp = time.mktime(last_update_tuple)
set_file_utime(filename, int(mod_timestamp))

When the modified time is stored into local filesystem, the resolution is one second. On next sync, however, the timestamp cannot match that in the remote s3 bucket, the resolution of which is one micro second.

@kdaily
Copy link
Member

kdaily commented Nov 23, 2020

Hi @shankerwangmiao,

I'm not quite following what the scenario is here. The current behavior should be if the file sizes are the same and the last modified time in s3 is greater (newer) than the local file, then a sync is not performed. Can you describe a little more completely, and/or provide debug logs demonstrating the behavior?

@kdaily kdaily added guidance Question that needs advice or information. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Nov 23, 2020
@kdaily
Copy link
Member

kdaily commented Nov 23, 2020

This issue (#599) might be useful.

@kdaily kdaily added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Nov 23, 2020
@shankerwangmiao
Copy link
Author

Hi, thanks for your information. My using scenario is to build a local mirror of a s3 bucket. Sometimes, especially for things like apt repos, some files are updated on the s3 bucket, but the size is unchanged. As a result, the default behavior is not what I want, and --exact-timestamps is used.

However, this option brings another issue in. As I pointed out in my original issue, when s3 sync stores the last modified timestamp into local filesystem, the resolution is one second. However, on the next time invoking s3 sync, the timestamp is read out from local filesystem and can seldom be the same as stored in s3 attribute, since the resolution of the latter is much higher. So, nearly all the files are downloaded from the s3 bucket once again.

I suggest 1. when storing a timestamp into the local filesystem, increase the resolution; 2. when comparing timestamps, consider different resolution of the local filesystem and s3 system.

@shankerwangmiao
Copy link
Author

I've updated my original issue to reflect the option I'm using, and sorry for missing detailed information.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 24, 2020
@kdaily
Copy link
Member

kdaily commented Nov 24, 2020

Hi @shankerwangmiao, thanks for the update and clarification!

Are you using an AWS S3 bucket or a third-party implementation? Specifically this part of your comment referring to the timestamp in the S3 bucket:

since the resolution of the latter is much higher

This also sounds like this issue #5369 as AWS S3 object timestamps are stored at the second resolution. See this comment: #5369 (comment)

@kdaily kdaily added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. s3sync s3syncstrategy labels Nov 24, 2020
@shankerwangmiao
Copy link
Author

Hi, thanks for your information. I can confirm that my symptom is exactly the same as that in #5369. However, I have no idea about the exact implementation on the server side. The reply from the server contains timestamps with microsecond resolution:

<Contents>
  <Key>favicon.ico</Key>
  <LastModified>2019-01-20T20:27:29.921Z</LastModified>
  <ETag>&quot;2c60955a31d74d6b554cc43434088b0c&quot;</ETag>
  <Size>15086</Size>
  <StorageClass>STANDARD</StorageClass>
  <Type>Normal</Type>
</Contents>

@shankerwangmiao
Copy link
Author

Since aws s3sync is so useful, it would be nice if rounding in the CLI side can be considered.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 24, 2020
@kdaily kdaily removed the guidance Question that needs advice or information. label Dec 3, 2020
@kdaily
Copy link
Member

kdaily commented Dec 3, 2020

Hi @shankerwangmiao,

Since the AWS S3 standard is storing time in seconds, any other implementation would need to do the same to ensure compatibility. Changing that at the CLI could cause incompatibility, especially if the AWS standard were to change as well.

I'll pass along this feedback to the S3 team to see about making this implementation detail more visible. I appreciate your information!

@kdaily kdaily closed this as completed Dec 3, 2020
@github-actions
Copy link

github-actions bot commented Dec 3, 2020

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants