Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CacheControl instead of custom cache code #1748

Merged
merged 2 commits into from
May 9, 2014

Conversation

dstufft
Copy link
Member

@dstufft dstufft commented Apr 24, 2014

  • Bundle CacheControl and lockfile
  • Use cross platform standard path locations
  • Some way to disable caching?
  • Soft fail if we cannot write to our cache directory
  • Ensure tests do not use a cache, unless we're testing the cache.
  • Write tests
  • Limit how long we'll cache a PackageFinder response.
  • Ensure PyPI uses sane CacheControl headers.
  • Deprecate the --download-cache option & removes the download cache code.
  • Remove the in memory page cache on the index
  • Use CacheControl to cache all cacheable HTTP requests to the filesystem.
    • Properly handles CacheControl headers for unconditional caching.
    • Use ETag and Last-Modified headers to attempt to do a conditional HTTP request to speed up cache misses and turn them into cache hits.
    • Remove some concurrency unsafe code in the download cache accesses.

Closes #1732
Closes #1734
Closes #1634
Closes #1673
Closes #1141
Closes #1287
Closes #685

"1.8",
"--download-cache has been deprecated and will be removed in "
" the future. Pip now has a generic HTTP Cache which will be "
"used automatically."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner if the second sentence was "Pip now automatically uses and configures its cache"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or something like that, I don't think the HTTP component is important,is my point)

@alex
Copy link
Member

alex commented Apr 24, 2014

(awesome stuff)

@dstufft
Copy link
Member Author

dstufft commented May 7, 2014

Currently this branch is adding two new options, --cache-dir and --no-cache. Is there any reasonable way we can collapse that down? I've thought about something like --cache-dir=off or something of the like to disable the cache but the magic string-ness of that is kind of offputting.

What do people think?

@alex
Copy link
Member

alex commented May 7, 2014

Doesn't pip already have CLI flags/env vars for configuring the cache? Why does this require adding new flags.

@dstufft
Copy link
Member Author

dstufft commented May 7, 2014

So two primary reasons:

  1. It was best practice to set --download-cache in order to speed things up, however it's best practice not to set --cache-dir unless you actually need to move it somewhere non standard. There isn't a very good way to communicate "hey we reversed what you should do with this option" besides deprecating and making a new option.
  2. The --download-cache was conceptually a cache of downloaded packages, whereas --cache-dir is a cache for anything we end up caching. In this PR we have <cache_dir>/http to store the HTTP cache, in a future PR we might have <cache_dir>/wheel to store automatically built wheels by default.

And a secondary reason:

  • The cache format has changed, the two formats could live side by side in the same directory, however I think it's much clearer for it to have a new directory.

@dstufft
Copy link
Member Author

dstufft commented May 7, 2014

Oh, and also:

  1. I think it's confusing to have something called "download cache" which will be used for things that are not downloads. I also think it's confusing to have a different cache option for each kind of cache we end up having.

@dstufft
Copy link
Member Author

dstufft commented May 7, 2014

@pypa/pip-developers Thoughts on --cache-dir off or --cache-dir none or something similar?

@pfmoore
Copy link
Member

pfmoore commented May 7, 2014

I don't dislike having --cache-dir xxx and --no-cache-dir - they feel like a single parameter with a "no" form and a "yes" form with an argument (I know that's not how it's implemented, but that's what it feels like). So the clutter isn't as bad as it might seem.

Having said that, I don't object to --cache-dir off or --cache-dir none particularly, if you prefer one of them...

@dstufft
Copy link
Member Author

dstufft commented May 8, 2014

Ok, I went with --no-cache-dir. I'd love to refactor our options sometimes so it'd support automatic support for "off" settings like --no-whatever.

@dstufft
Copy link
Member Author

dstufft commented May 8, 2014

@pfmoore If you can try this out on Windows that'd be awesome. All you need to do is run this branch, and try to like, pip install Django or something big twice in a row. First time should be slow second time should be instant. Also you should see a C:\Users\<username>\AppData\Local\pip\Cache directory created with nested folders inside of it and files named after hashes.

@dstufft
Copy link
Member Author

dstufft commented May 8, 2014

Test failures fixed in psf/cachecontrol#30

@pfmoore
Copy link
Member

pfmoore commented May 8, 2014

@dstufft Worked fine. There's a definite improvement in speed but second install of django is still far from instant... (57 sec with cache, 66 sec cold) All the time is in the actual install, not the download though, so that's not relevant.

So yeah, looks good to me.

@dstufft
Copy link
Member Author

dstufft commented May 8, 2014

Err, yea sorry I meant the download itself should be nearly instant, sorry for not being clear.

Thanks a lot for testing that out for me!


Like all options, :ref:`--download-cache <install_--download-cache>`, can also
be set as an environment variable, or placed into the pip config file. See the
:ref:`Configuration` section.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's always been common for people to be surprised that pip still crawls PyPI for packages, even after they are fully cached to fulfill their requirements. paragraphs 2-4 explains that, and explains how to achieve a local-only lookup. we should keep that portion and apply it to docs on the new general cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it doesn't really explain how to achieve local lookup, it just directs you to the cookbook area which does explain how to achieve local lookup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point on the local lookup bit, however I'm not sure where this would go right now. At the moment we don't have any specific docs on the cache other than what is generated automatically. The idea is that this cache should just work and people shouldn't generally need to mess with it or even know about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it disabuses people of the notion that the download-cache was a cache by package name, and then provides the link on how to achieve the speedy install they really wanted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but there is no more download cache with this PR, it's just a generic HTTP cache now like a browser has.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache has expired, it'll just do a conditional GET instead of a full GET.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no more "download cache", but there's still a url-based cache, that walks PyPI regardless of how full your cache is, right? that's what I want people to still understand somewhere in the docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might walk PyPI it might not touch the network at all, it depends.

It uses the same cache semantics as the browser. If PyPI has a CacheControl: max-age: 600 header in the response to an url, then for the next 600 seconds requesting that same URL again will not touch PyPI. After 600 seconds has gone by the item is still in the cache, but instead of just using the item from the cache unconditionally it will make a conditional GET, which will hit PyPI and say "I want this URL, but only if it doesn't match this ETag". PyPI will look what ETag they sent, and either return a 304 with an empty response if the ETags match, or it'll send a 200 with the content like normal. If we get a 304 response then we'll just refresh the timer on the cache for another 600 seconds (or whatever the max-age is on that response) and return the data we have stored in the cache.

Does that make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added documentation about the Caching.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks

@dstufft
Copy link
Member Author

dstufft commented May 9, 2014

Ok, so @qwcode was worried about a server misconfiguration and causing us to cache it for an unreasonably long time. I've been thinking about this and looking at options and I realized we can specify a maximum time we're willing to accept a cache for by adding a CacheControl: max-age=NNN header to the request headers like:

session.get(url, headers={"CacheControl": "max-age=60"})

If we want to reject any cached items older than 60 (for example) even if the server says we can. With that in mind here's what I'm thinking:

  • PackageFinder: We set a max cache time of 5 minutes, we'll still get efficient etag support outside of that 5 minutes however the longest a person has to wait for a new release to show up is 5 minutes (and then only if they already tried to install it within the last 5 minutes).
  • Packages: We don't set a max cache time and we just trust this server. If someone does something bad like deletes an upload and replaces it then people will need to purge their cache or wait for the server's cache time to expire. Otherwise these URLs shouldn't change.

@qwcode
Copy link
Contributor

qwcode commented May 9, 2014

@dstufft ok, nice, that sounds good to me.

@dstufft
Copy link
Member Author

dstufft commented May 9, 2014

Ok, I went with 10 minutes instead of 5 for PackageFinder.

I'm going to adjust PyPI and then merge this later tonight.

* Deprecates the --download-cache option & removes the download
  cache code.
* Removes the in memory page cache on the index
* Uses CacheControl to cache all cacheable HTTP requests to the
  filesystem.
  * Properly handles CacheControl headers for unconditional
    caching.
  * Will use ETag and Last-Modified headers to attempt to do a
    conditional HTTP request to speed up cache misses and turn
    them into cache hits.
  * Removes some concurrency unsafe code in the download cache
    accesses.
  * Uses a Cache-Control request header to limit the maximum
    length of time a cache is valid for.
* Adds pip.appdirs to handle platform specific application
  directories such as cache, config, data, etc.
@dstufft
Copy link
Member Author

dstufft commented May 9, 2014

Okies, stuff got deployed.

dstufft added a commit that referenced this pull request May 9, 2014
Use CacheControl instead of custom cache code
@dstufft dstufft merged commit a26293b into pypa:develop May 9, 2014
@dstufft dstufft deleted the http-cache branch May 9, 2014 23:55
@pfmoore pfmoore mentioned this pull request Aug 21, 2014
@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 4, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants