Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cached download should be ignored if URL changes #349

Open
dbnicholson opened this issue Sep 12, 2023 · 4 comments
Open

Cached download should be ignored if URL changes #349

dbnicholson opened this issue Sep 12, 2023 · 4 comments
Assignees
Labels

Comments

@dbnicholson
Copy link

Describe the bug

If the URL has changed but the destination path hasn't then the existing destination file should be ignored for the purposes of onlyIfModified. Otherwise, if the timestamp of the new URL is older than the existing download, it will be skipped and you'll continue using the wrong download.

Sample build script

plugins {
    id 'de.undercouch.download' version '5.5.0'
}

def welcomeScreenVersion = '6.46.0'

tasks.register('downloadWelcomeScreen', Download) {
    src "https://github.com/endlessm/kolibri-explore-plugin/releases/download/v${welcomeScreenVersion}/welcome-screen.zip"
    dest layout.buildDirectory.file('welcome-screen.zip')
    onlyIfModified true
}

tasks.register('clean', Delete) {
    delete buildDir
}

If I run the task and then change welcomeScreenVersion to 6.44.0, the task is skipped:

$ ls -l build/welcome-screen.zip 
-rw-rw-r-- 1 dan dan 5336916 Sep 12 16:32 build/welcome-screen.zip
$ ./gradlew -i downloadWelcomeScreen
...
> Task :downloadWelcomeScreen UP-TO-DATE
Caching disabled for task ':downloadWelcomeScreen' because:
  Build cache is disabled
Task ':downloadWelcomeScreen' is not up-to-date because:
  Task.upToDateWhen is false.
Watching 3 directories to track changes
Download https://github.com/endlessm/kolibri-explore-plugin/releases/download/v6.44.0/welcome-screen.zip
Not modified. Skipping 'https://github.com/endlessm/kolibri-explore-plugin/releases/download/v6.44.0/welcome-screen.zip'
Watching 4 directories to track changes

BUILD SUCCESSFUL in 1s
1 actionable task: 1 up-to-date
$ ls -l build/welcome-screen.zip 
-rw-rw-r-- 1 dan dan 5336916 Sep 12 16:32 build/welcome-screen.zip
@dbnicholson
Copy link
Author

While I think it's nice to use the timestamp of the destination file as the source for the If-Modified-Since header, it loses the URL. It would be better to keep a map of URL to timestamp like is done with ETags. That way if the URL changes, the timestamp won't be found and no If-Modified-Since header will be added.

@dbnicholson
Copy link
Author

An extension of this problem is that if you set useETag, then it won't be used because the URL isn't in the cached JSON and then it's operating entirely on Last-Modified. So, even though the ETag of the new URL doesn't match the previous download, it's essentially being ignored.

@michel-kraemer
Copy link
Owner

michel-kraemer commented Sep 23, 2023

Thanks for reporting this. I can see that this behaviour seems odd but everything works as it's supposed to.

I understand that saving the URL in a map like it's done with the ETag would solve your particular example, but I'm not sure if it would address the root of the problem. It is true that the plugin uses the file's timestamp as a fallback if it cannot find an ETag for a given URL. However, if it can find one, it also skips downloading.

Try this: Enable useETag and then download welcomeScreenVersion 6.44.0 first. Change the version to 6.46.0 and download this too. The plugin will now know both ETags. If you change the version back to 6.44.0, you will see the exact same behaviour you're seeing in your example: the task will be skipped.

The problem is even worse here, by the way. If you do it the other way around, the plugin will not be able to know which of the two ETags is newer and will not overwrite version 6.46.0 with 6.44.0 like it would do if useETag was disabled and it only compared the timestamps.

The only way to really solve the root cause would be to know if the task's inputs have changed since the last call, and if they had, skip any up-to-date checks and redownload the file. This might become possible in the future when Gradle's configuration cache becomes enabled by default, but it is certainly not possible now.

From my point of view, if the user changes the URL of the file to download, they should manually remove the destination file or call the clean task. Another (more preferable) possibility would be to include the version number in the destination file name. If you need the filename in a subsequent task, you can always get it through downloadWelcomeScreen.dest or downloadWelcomeScreen.outputs.

Let me know if this helps. I'm open to feedback and ideas!

P.S.: I'm currently out of office with limited Internet access. It might take me some days to reply.

@dbnicholson
Copy link
Author

I understand that saving the URL in a map like it's done with the ETag would solve your particular example, but I'm not sure if it would address the root of the problem. It is true that the plugin uses the file's timestamp as a fallback if it cannot find an ETag for a given URL. However, if it can find one, it also skips downloading.

This just seems like a bug in the cache implementation. It's only storing the URL but what matters is the ETag of the saved file. That's why you use the timestamp of the saved file for the Last-Modified handling, right? Without the saved file path, you can't make an accurate decision about when to invalidate the cache entries since you don't have enough information. You have the previous ETag of a URL, but you don't know if the file your using it to make decisions about was created from one of those URLs or not. Knowing the ETag of a URL at some point in time doesn't actually give you any useful information if it's not associated to a file you have locally.

I think the ETag cache should be keyed by the saved path:

{
  "path/to/downloaded/file": {
    "URL": "https://someurl",
    "ETag": "\"someetag\""
  }
}

This has 2 effects:

  • A strong connection is maintained between the saved file and the parameters used to form it.
  • The previous cache entry is automatically deleted when you overwrite that paths entry it after a new download. You can no longer accidentally get invalid cache data for the file at that path.

From my point of view, if the user changes the URL of the file to download, they should manually remove the destination file or call the clean task. Another (more preferable) possibility would be to include the version number in the destination file name. If you need the filename in a subsequent task, you can always get it through downloadWelcomeScreen.dest or downloadWelcomeScreen.outputs.

That's fine if the person making the changes and doing the builds is always the same person, but it breaks down otherwise. What if my CI system has the download cached? What if another user in my team has the download cached and they didn't see that the URL changed? The only reliable way to handle that is to not cache at all or to always clean, which are effectively the same thing.

I'll change my tasks so they include the version number in the saved filename. I was trying to avoid it before so that when the code references the asset it can always use the generic name, but I don't actually need to reference it from the code right now.

To circle back to one thing I said in the issue description - I think there should be an option to ignore Last-Modified if you're using an ETag and the server supplies one. That would put it in line with the HTTP behavior for If-Modified-Since:

When used in combination with If-None-Match, it is ignored, unless the server doesn't support If-None-Match.

Which makes sense. If I say "I have a file that has this identifier, tell me if the one you have matches", then I don't care if it's older or newer. I only care whether the content is the same. It's like what a file archiver would do with checksums.

dbnicholson added a commit to dbnicholson/kolibri-chaquopy that referenced this issue Sep 23, 2023
download-task doesn't handle cached files with changed URLs well[1], so
include the version in the saved file.

1. michel-kraemer/gradle-download-task#349
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants