Use CacheControl instead of custom cache code #1748

dstufft · 2014-04-24T01:11:27Z

Closes #1732
Closes #1734
Closes #1634
Closes #1673
Closes #1141
Closes #1287
Closes #685

alex · 2014-04-24T01:14:34Z

pip/commands/install.py

+                "1.8",
+                "--download-cache has been deprecated and will be removed in "
+                " the future. Pip now has a generic HTTP Cache which will be "
+                "used automatically."


I think it would be cleaner if the second sentence was "Pip now automatically uses and configures its cache"

(or something like that, I don't think the HTTP component is important,is my point)

alex · 2014-04-24T01:18:16Z

(awesome stuff)

dstufft · 2014-05-07T13:03:32Z

Currently this branch is adding two new options, --cache-dir and --no-cache. Is there any reasonable way we can collapse that down? I've thought about something like --cache-dir=off or something of the like to disable the cache but the magic string-ness of that is kind of offputting.

What do people think?

alex · 2014-05-07T13:05:59Z

Doesn't pip already have CLI flags/env vars for configuring the cache? Why does this require adding new flags.

dstufft · 2014-05-07T13:17:56Z

So two primary reasons:

It was best practice to set --download-cache in order to speed things up, however it's best practice not to set --cache-dir unless you actually need to move it somewhere non standard. There isn't a very good way to communicate "hey we reversed what you should do with this option" besides deprecating and making a new option.
The --download-cache was conceptually a cache of downloaded packages, whereas --cache-dir is a cache for anything we end up caching. In this PR we have <cache_dir>/http to store the HTTP cache, in a future PR we might have <cache_dir>/wheel to store automatically built wheels by default.

And a secondary reason:

The cache format has changed, the two formats could live side by side in the same directory, however I think it's much clearer for it to have a new directory.

dstufft · 2014-05-07T13:20:32Z

Oh, and also:

I think it's confusing to have something called "download cache" which will be used for things that are not downloads. I also think it's confusing to have a different cache option for each kind of cache we end up having.

dstufft · 2014-05-07T21:15:20Z

@pypa/pip-developers Thoughts on --cache-dir off or --cache-dir none or something similar?

pfmoore · 2014-05-07T22:01:27Z

I don't dislike having --cache-dir xxx and --no-cache-dir - they feel like a single parameter with a "no" form and a "yes" form with an argument (I know that's not how it's implemented, but that's what it feels like). So the clutter isn't as bad as it might seem.

Having said that, I don't object to --cache-dir off or --cache-dir none particularly, if you prefer one of them...

dstufft · 2014-05-08T00:29:37Z

Ok, I went with --no-cache-dir. I'd love to refactor our options sometimes so it'd support automatic support for "off" settings like --no-whatever.

dstufft · 2014-05-08T02:48:46Z

@pfmoore If you can try this out on Windows that'd be awesome. All you need to do is run this branch, and try to like, pip install Django or something big twice in a row. First time should be slow second time should be instant. Also you should see a C:\Users\<username>\AppData\Local\pip\Cache directory created with nested folders inside of it and files named after hashes.

dstufft · 2014-05-08T03:19:47Z

Test failures fixed in psf/cachecontrol#30

pfmoore · 2014-05-08T08:56:29Z

@dstufft Worked fine. There's a definite improvement in speed but second install of django is still far from instant... (57 sec with cache, 66 sec cold) All the time is in the actual install, not the download though, so that's not relevant.

So yeah, looks good to me.

dstufft · 2014-05-08T11:09:25Z

Err, yea sorry I meant the download itself should be nearly instant, sorry for not being clear.

Thanks a lot for testing that out for me!

qwcode · 2014-05-08T18:45:24Z

docs/reference/pip_install.rst

-
-Like all options, :ref:`--download-cache <install_--download-cache>`, can also
-be set as an environment variable, or placed into the pip config file.  See the
-:ref:`Configuration` section.


it's always been common for people to be surprised that pip still crawls PyPI for packages, even after they are fully cached to fulfill their requirements. paragraphs 2-4 explains that, and explains how to achieve a local-only lookup. we should keep that portion and apply it to docs on the new general cache.

Hmm, it doesn't really explain how to achieve local lookup, it just directs you to the cookbook area which does explain how to achieve local lookup.

Fair point on the local lookup bit, however I'm not sure where this would go right now. At the moment we don't have any specific docs on the cache other than what is generated automatically. The idea is that this cache should just work and people shouldn't generally need to mess with it or even know about it.

it disabuses people of the notion that the download-cache was a cache by package name, and then provides the link on how to achieve the speedy install they really wanted.

Right, but there is no more download cache with this PR, it's just a generic HTTP cache now like a browser has.

If the cache has expired, it'll just do a conditional GET instead of a full GET.

there's no more "download cache", but there's still a url-based cache, that walks PyPI regardless of how full your cache is, right? that's what I want people to still understand somewhere in the docs.

It might walk PyPI it might not touch the network at all, it depends.

It uses the same cache semantics as the browser. If PyPI has a CacheControl: max-age: 600 header in the response to an url, then for the next 600 seconds requesting that same URL again will not touch PyPI. After 600 seconds has gone by the item is still in the cache, but instead of just using the item from the cache unconditionally it will make a conditional GET, which will hit PyPI and say "I want this URL, but only if it doesn't match this ETag". PyPI will look what ETag they sent, and either return a 304 with an empty response if the ETags match, or it'll send a 200 with the content like normal. If we get a 304 response then we'll just refresh the timer on the cache for another 600 seconds (or whatever the max-age is on that response) and return the data we have stored in the cache.

Does that make sense?

I added documentation about the Caching.

cool, thanks

dstufft · 2014-05-09T17:22:21Z

Ok, so @qwcode was worried about a server misconfiguration and causing us to cache it for an unreasonably long time. I've been thinking about this and looking at options and I realized we can specify a maximum time we're willing to accept a cache for by adding a CacheControl: max-age=NNN header to the request headers like:

session.get(url, headers={"CacheControl": "max-age=60"})

If we want to reject any cached items older than 60 (for example) even if the server says we can. With that in mind here's what I'm thinking:

PackageFinder: We set a max cache time of 5 minutes, we'll still get efficient etag support outside of that 5 minutes however the longest a person has to wait for a new release to show up is 5 minutes (and then only if they already tried to install it within the last 5 minutes).
Packages: We don't set a max cache time and we just trust this server. If someone does something bad like deletes an upload and replaces it then people will need to purge their cache or wait for the server's cache time to expire. Otherwise these URLs shouldn't change.

qwcode · 2014-05-09T17:44:40Z

@dstufft ok, nice, that sounds good to me.

dstufft · 2014-05-09T21:46:55Z

Ok, I went with 10 minutes instead of 5 for PackageFinder.

I'm going to adjust PyPI and then merge this later tonight.

* Deprecates the --download-cache option & removes the download cache code. * Removes the in memory page cache on the index * Uses CacheControl to cache all cacheable HTTP requests to the filesystem. * Properly handles CacheControl headers for unconditional caching. * Will use ETag and Last-Modified headers to attempt to do a conditional HTTP request to speed up cache misses and turn them into cache hits. * Removes some concurrency unsafe code in the download cache accesses. * Uses a Cache-Control request header to limit the maximum length of time a cache is valid for. * Adds pip.appdirs to handle platform specific application directories such as cache, config, data, etc.

dstufft · 2014-05-09T23:55:24Z

Okies, stuff got deployed.

Use CacheControl instead of custom cache code

alex reviewed Apr 24, 2014
View reviewed changes

Ivoz mentioned this pull request Apr 27, 2014

IOError: File name too long #1634

Closed

qwcode reviewed May 8, 2014
View reviewed changes

dstufft added 2 commits May 9, 2014 18:17

Bundle CacheControl and lockfile

077fa14

dstufft added a commit that referenced this pull request May 9, 2014

Merge pull request #1748 from dstufft/http-cache

a26293b

Use CacheControl instead of custom cache code

dstufft merged commit a26293b into pypa:develop May 9, 2014

dstufft deleted the http-cache branch May 9, 2014 23:55

pfmoore mentioned this pull request Aug 21, 2014

Standardize our Paths #1733

Closed

lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 4, 2019

lock bot locked as resolved and limited conversation to collaborators Jun 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CacheControl instead of custom cache code #1748

Use CacheControl instead of custom cache code #1748

dstufft commented Apr 24, 2014

alex Apr 24, 2014

alex Apr 24, 2014

alex commented Apr 24, 2014

dstufft commented May 7, 2014

alex commented May 7, 2014

dstufft commented May 7, 2014

dstufft commented May 7, 2014

dstufft commented May 7, 2014

pfmoore commented May 7, 2014

dstufft commented May 8, 2014

dstufft commented May 8, 2014

dstufft commented May 8, 2014

pfmoore commented May 8, 2014

dstufft commented May 8, 2014

qwcode May 8, 2014

dstufft May 8, 2014

dstufft May 8, 2014

qwcode May 8, 2014

dstufft May 8, 2014

dstufft May 8, 2014

qwcode May 8, 2014

dstufft May 8, 2014

dstufft May 8, 2014

qwcode May 8, 2014

dstufft commented May 9, 2014

qwcode commented May 9, 2014

dstufft commented May 9, 2014

dstufft commented May 9, 2014

Use CacheControl instead of custom cache code #1748

Use CacheControl instead of custom cache code #1748

Conversation

dstufft commented Apr 24, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex commented Apr 24, 2014

dstufft commented May 7, 2014

alex commented May 7, 2014

dstufft commented May 7, 2014

dstufft commented May 7, 2014

dstufft commented May 7, 2014

pfmoore commented May 7, 2014

dstufft commented May 8, 2014

dstufft commented May 8, 2014

dstufft commented May 8, 2014

pfmoore commented May 8, 2014

dstufft commented May 8, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstufft commented May 9, 2014

qwcode commented May 9, 2014

dstufft commented May 9, 2014

dstufft commented May 9, 2014