-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The fast-deps feature is not a fast way to obtain dependencies #8670
Comments
By applying this patch
diff --git a/src/pip/_internal/network/lazy_wheel.py b/src/pip/_internal/network/lazy_wheel.py
index c2371bf5..c9244bb5 100644
--- a/src/pip/_internal/network/lazy_wheel.py
+++ b/src/pip/_internal/network/lazy_wheel.py
@@ -60,6 +60,7 @@ class LazyZipOverHTTP(object):
def __init__(self, url, session, chunk_size=CONTENT_CHUNK_SIZE):
# type: (str, PipSession, int) -> None
+ self._count = 0
head = session.head(url, headers=HEADERS)
raise_for_status(head)
assert head.status_code == 200
@@ -158,6 +159,7 @@ class LazyZipOverHTTP(object):
def __exit__(self, *exc):
# type: (*Any) -> Optional[bool]
+ print(self._count, 'requests to fetch metadata from', self._url[107:])
return self._file.__exit__(*exc)
@contextmanager
@@ -192,6 +194,7 @@ class LazyZipOverHTTP(object):
def _stream_response(self, start, end, base_headers=HEADERS):
# type: (int, int, Dict[str, str]) -> Response
"""Return HTTP response to a range request from start to end."""
+ self._count += 1
headers = {'Range': 'bytes={}-{}'.format(start, end)}
headers.update(base_headers)
return self._session.get(self._url, headers=headers, stream=True) I obtained the number of requests to fetch the metadata from each wheel: $ pip install tensorflow --no-cache-dir | grep 'requests to fetch metadata from' | sort -k7
10 requests to fetch metadata from astunparse-1.6.3-py2.py3-none-any.whl
17 requests to fetch metadata from cachetools-4.1.1-py3-none-any.whl
1 requests to fetch metadata from gast-0.3.3-py2.py3-none-any.whl
4 requests to fetch metadata from google_auth-1.20.0-py2.py3-none-any.whl
1 requests to fetch metadata from google_auth_oauthlib-0.4.1-py2.py3-none-any.whl
4 requests to fetch metadata from google_pasta-0.2.0-py3-none-any.whl
13 requests to fetch metadata from grpcio-1.30.0-cp38-cp38-manylinux2010_x86_64.whl
14 requests to fetch metadata from h5py-2.10.0-cp38-cp38-manylinux1_x86_64.whl
16 requests to fetch metadata from Keras_Preprocessing-1.1.2-py2.py3-none-any.whl
4 requests to fetch metadata from Markdown-3.2.2-py3-none-any.whl
23 requests to fetch metadata from numpy-1.18.5-cp38-cp38-manylinux1_x86_64.whl
17 requests to fetch metadata from oauthlib-3.1.0-py2.py3-none-any.whl
10 requests to fetch metadata from opt_einsum-3.3.0-py3-none-any.whl
16 requests to fetch metadata from protobuf-3.12.4-cp38-cp38-manylinux1_x86_64.whl
7 requests to fetch metadata from pyasn1-0.4.8-py2.py3-none-any.whl
20 requests to fetch metadata from pyasn1_modules-0.2.8-py2.py3-none-any.whl
7 requests to fetch metadata from requests_oauthlib-1.3.0-py2.py3-none-any.whl
1 requests to fetch metadata from rsa-4.6-py3-none-any.whl
14 requests to fetch metadata from scipy-1.4.1-cp38-cp38-manylinux1_x86_64.whl
1 requests to fetch metadata from six-1.15.0-py2.py3-none-any.whl
20 requests to fetch metadata from tensorboard-2.3.0-py3-none-any.whl
17 requests to fetch metadata from tensorboard_plugin_wit-1.7.0-py3-none-any.whl
20 requests to fetch metadata from tensorflow-2.3.0-cp38-cp38-manylinux2010_x86_64.whl
14 requests to fetch metadata from tensorflow_estimator-2.3.0-py2.py3-none-any.whl
1 requests to fetch metadata from Werkzeug-1.0.1-py2.py3-none-any.whl
1 requests to fetch metadata from wheel-0.34.2-py2.py3-none-any.whl I've yet to know what to do with this information, but it might help deciding the multiplier (to chunk size) to avoid using range requests. Edit: attempt to download up to chunk size seems to be better:
diff --git a/src/pip/_internal/network/lazy_wheel.py b/src/pip/_internal/network/lazy_wheel.py
index c2371bf5..d5967057 100644
--- a/src/pip/_internal/network/lazy_wheel.py
+++ b/src/pip/_internal/network/lazy_wheel.py
@@ -60,6 +60,7 @@ class LazyZipOverHTTP(object):
def __init__(self, url, session, chunk_size=CONTENT_CHUNK_SIZE):
# type: (str, PipSession, int) -> None
+ self._count = 0
head = session.head(url, headers=HEADERS)
raise_for_status(head)
assert head.status_code == 200
@@ -109,8 +110,10 @@ class LazyZipOverHTTP(object):
all bytes until EOF are returned. Fewer than
size bytes may be returned if EOF is reached.
"""
+ download_size = max(size, self._chunk_size)
start, length = self.tell(), self._length
- stop = start + size if 0 <= size <= length-start else length
+ stop = length if size < 0 else min(start+download_size, length)
+ start = max(0, stop-download_size)
self._download(start, stop-1)
return self._file.read(size)
@@ -158,6 +161,7 @@ class LazyZipOverHTTP(object):
def __exit__(self, *exc):
# type: (*Any) -> Optional[bool]
+ print(self._count, 'requests to fetch metadata from', self._url[107:])
return self._file.__exit__(*exc)
@contextmanager
@@ -192,6 +196,7 @@ class LazyZipOverHTTP(object):
def _stream_response(self, start, end, base_headers=HEADERS):
# type: (int, int, Dict[str, str]) -> Response
"""Return HTTP response to a range request from start to end."""
+ self._count += 1
headers = {'Range': 'bytes={}-{}'.format(start, end)}
headers.update(base_headers)
return self._session.get(self._url, headers=headers, stream=True) pip install tensorflow --no-cache-dir | grep 'requests to fetch metadata from' | sort -k7
2 requests to fetch metadata from astunparse-1.6.3-py2.py3-none-any.whl
2 requests to fetch metadata from cachetools-4.1.1-py3-none-any.whl
1 requests to fetch metadata from gast-0.3.3-py2.py3-none-any.whl
3 requests to fetch metadata from google_auth-1.20.0-py2.py3-none-any.whl
2 requests to fetch metadata from google_auth_oauthlib-0.4.1-py2.py3-none-any.whl
2 requests to fetch metadata from google_pasta-0.2.0-py3-none-any.whl
2 requests to fetch metadata from grpcio-1.30.0-cp38-cp38-manylinux2010_x86_64.whl
6 requests to fetch metadata from h5py-2.10.0-cp38-cp38-manylinux1_x86_64.whl
2 requests to fetch metadata from Keras_Preprocessing-1.1.2-py2.py3-none-any.whl
2 requests to fetch metadata from Markdown-3.2.2-py3-none-any.whl
19 requests to fetch metadata from numpy-1.18.5-cp38-cp38-manylinux1_x86_64.whl
3 requests to fetch metadata from oauthlib-3.1.0-py2.py3-none-any.whl
2 requests to fetch metadata from opt_einsum-3.3.0-py3-none-any.whl
12 requests to fetch metadata from protobuf-3.12.4-cp38-cp38-manylinux1_x86_64.whl
2 requests to fetch metadata from pyasn1-0.4.8-py2.py3-none-any.whl
4 requests to fetch metadata from pyasn1_modules-0.2.8-py2.py3-none-any.whl
2 requests to fetch metadata from requests_oauthlib-1.3.0-py2.py3-none-any.whl
2 requests to fetch metadata from rsa-4.6-py3-none-any.whl
15 requests to fetch metadata from scipy-1.4.1-cp38-cp38-manylinux1_x86_64.whl
2 requests to fetch metadata from six-1.15.0-py2.py3-none-any.whl
16 requests to fetch metadata from tensorboard-2.3.0-py3-none-any.whl
2 requests to fetch metadata from tensorboard_plugin_wit-1.7.0-py3-none-any.whl
11 requests to fetch metadata from tensorflow-2.3.0-cp38-cp38-manylinux2010_x86_64.whl
5 requests to fetch metadata from tensorflow_estimator-2.3.0-py2.py3-none-any.whl
2 requests to fetch metadata from Werkzeug-1.0.1-py2.py3-none-any.whl
2 requests to fetch metadata from wheel-0.34.2-py2.py3-none-any.whl With exit right after complete of resolution, the latter version only uses 20s while the current implementation would take 30s. I'm filing a PR for this.
$ git diff
diff --git a/src/pip/_internal/resolution/resolvelib/resolver.py b/src/pip/_internal/resolution/resolvelib/resolver.py
index 43ea2486..aad532df 100644
--- a/src/pip/_internal/resolution/resolvelib/resolver.py
+++ b/src/pip/_internal/resolution/resolvelib/resolver.py
@@ -125,6 +125,7 @@ class Resolver(BaseResolver):
error = self.factory.get_installation_error(e)
six.raise_from(error, e)
+ exit()
req_set = RequirementSet(check_supported_wheels=check_supported_wheels)
for candidate in self._result.mapping.values():
ireq = candidate.get_install_requirement() |
Noting JSON (redesigned) and simple (pypi/warehouse#8254) API may provide a more efficient and proformance-wise deterministic way to fetch metadata. |
Closed by #8681? |
Should be! |
I'm not exactly sure about closing this: whilst GH-8681 does make fast-deps faster, I don't have any benchmark to prove (at least to myself) that it's always/mostly faster than downloading the whole wheels, especially when most pure Python packages are rather small. I'd prefer to keep this one open and close it myself when I'm more certain about the performance. |
Here ya go. :) |
Yay! I'm running some benchmarks lately and funny enough dependency resolution with |
How does it perform in cases w/ backtracking though? Try |
How much slower? |
Please checkout the benchmark here. I couldn't get |
@McSinyx are you still interested in this? I modified lazy-wheel in
In https://github.com/pypa/pip/blob/main/src/pip/_internal/network/lazy_wheel.py#L95 I changed it to
This gets us down to four range requests:
As a further optimization, I read all bytes between the header offsets of the file I want, and the next header offset. Instead of letting
Gets us down to the optimal two requests. (We should emit up to two Range requests for the footer, depending on whether the footer is smaller than the chunk size or not, and then we can emit one per compressed file).
|
The feature also makes an unnecessary |
Unfortunately |
I think that checks out because at least |
I'm re-using this code to look at It looks like this code limits Range requests to no more than CHUNK_SIZE bytes? Since metadata tends to be small it doesn't usually matter but it would be nice if there was no upper limit. |
Code to eliminate the HEAD request headers["Range"] = f"bytes=-{CONTENT_CHUNK_SIZE}"
# if CONTENT_CHUNK_SIZE is bigger than the file:
# In [8]: response.headers["Content-Range"]
# Out[8]: 'bytes 0-3133374/3133375'
tail = session.get(url, headers=headers, stream=True)
# e.g. {'accept-ranges': 'bytes', 'content-length': '10240',
# 'content-range': 'bytes 12824-23063/23064', 'last-modified': 'Sat, 16
# Apr 2022 13:03:02 GMT', 'date': 'Thu, 21 Apr 2022 11:34:04 GMT'}
if tail.status_code != 206:
raise HTTPRangeRequestUnsupported("range request is not supported")
self._session, self._url, self._chunk_size = session, url, chunk_size
self._length = int(tail.headers["Content-Range"].partition("/")[-1])
self._file = NamedTemporaryFile()
self.truncate(self._length)
# length is also in Content-Length and Content-Range header
with self._stay():
self.seek(self._length - len(tail.content))
self._file.write(tail.content)
self._left: List[int] = [self._length - len(tail.content)]
self._right: List[int] = [self._length - 1] |
|
Lazier wheel is here. It should be compatible but I haven't tested it yet. It avoids the If you find that the last 10k of the file isn't a good enough heuristic, you can also lookup the content range for the desired |
I had a chance to test it. I had to handle This reduces the # of requests.
|
I added in the With minimal requests (prefetch):
|
Originally posted over Python discuss:
Moreover, unlike the normal wheel download, the lazy implementation performs multiple requests. On unstable networks like mine, this makes it a lot slower than downloading the same size of data but in one single response (citation needed, I know this is generally believed but I'd love to read an article explain the reason in details). The first step to tackle this that I have in mind is to refuse to use range requests when a wheel is smaller than a certain size (a few times of chunk size perhaps?). There might need to be more experiments to further optimize this but this is the first thing I can think of. I'd really appreciate any suggestion, even just mere ideas of what to explore.
If possible, please assign this issue to me so I can better keep track of my GSoC project.
The text was updated successfully, but these errors were encountered: