Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index pip-installed packages #109

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

hoodmane
Copy link
Member

@hoodmane hoodmane commented May 24, 2024

This isn't very efficient, we should use pep 658 metadata which pypi provides now so we don't have to redownload the wheel. Also we should do fetches concurrently and asyncio.gather them. But it gets the basic idea.

Would resolve #107.

cc @jaraco @ryanking13 do either of you see anything wrong with this concept?

This isn't very efficient, we should use pep 658 metadata which pypi provides
now so we don't have to redownload the wheel. Also we should do fetches
concurrently and asyncio.gather them. But it gets the basic concept.
Copy link
Contributor

@jaraco jaraco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea, but as you point out, the querying and download of wheels seems expensive. Have we yet explored how important it is for the file_name and sha256 fields to be populated in the result? Could those fields instead be made optional for installed packages that no longer have their attribution? i.e. Why not make file_name: str | None?

file_name=wheel.url,
install_dir="site",
sha256=wheel.sha256,
imports=[],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like imports is stubbed here. It should probably query top_level.txt, same as micropip.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unfortunate that top_level.txt is setuptools only. As I understand it, the other build backends don't have anything similar. Do you know why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think top_level.txt is not very good for querying import names, and probably we should stop using it in micropip.

Ref: pyodide/pyodide#3006 (review)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several good examples in that thread. Also, namespace packages aren't reflected accurately either:

 @ pip-run jaraco.functools -- -c "import importlib.metadata as md; print(md.distribution('jaraco.functools').read_text('top_level.txt'))"
jaraco

(jaraco.functools is one of many packages that exposes the top-level jaraco package; the unique offering of jaraco.functools is the python package of the same name)

micropip/_commands/freeze.py Show resolved Hide resolved
micropip/_commands/freeze.py Show resolved Hide resolved
ver = resp.releases.get(Version(dist.version), None)
if ver is None:
return None
wheel = next(ver)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems risky. What if the package doesn't have a wheel, but only an sdist? What if the sdist is chosen over the wheel? What if the package was installed from a local checkout that advertises the same version (but is different)? What if the package was installed from a local checkout that advertises a local version (that's not present in PyPI)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I agree with these concerns. I agree that there needs to be some sort of architecture to our packaging system and currently there is no architecture merely an accumulation of behaviors. I will answer your questions in defense of this approach because I think the answers contain interesting information about our domain logic and not because this approach is any good.

What if the package doesn't have a wheel, but only an sdist?

This is a great question. It will fail. If the wheel is platformed, then the pip build frontend cannot build it. The only way to build a Pyodide platformed wheel is with our own pyodide build frontend. I looked into patching pip so that it uses our frontend but I am pretty sure it is not a reasonable task without support from the pip team. But for a non-platformed py3-none-none package, it can just upload an sdist and then pip will download it and install it. The problem is that this whole lock file idea is predicated on the idea that the packages came from somewhere, and that we didn't do a local build. Probably if we find packages that were installed from sdist, then freeze() should raise an error complaining specifically about this.

What if the package was installed from a local checkout that advertises the same version (but is different)?

Same issue, can only install from a url. So yeah we would indeed freeze a different wheel than was actually installed. We should probably check for this? Note that pip freeze also can't save you from this situation, although in normal Python it is fine if pip itself installed the package from sdist. But the fact that if you do

pip freeze > requirements.txt
python -m venv .venv-new
.venv-new/bin/pip install -r requirements.txt

then .venv-new can end up being different than the current environment is indicating that pip freeze does not generate a true lock but something weaker. Our aspiration is to generate a lock and we should error with an explanation if we cannot generate one.

url = dist.read_text("PYODIDE_URL")
if url:
pkg_entry = get_pkg_entry_micropip(dist, url)
elif IN_VENV:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure why IN_VENV is relevant. Packages can be installed without virtualenvs (e.g. on PYTHONPATH). And virtualenvs can expose site-packages.

What is the motivation for varying on IN_VENV?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is wrong to branch on IN_VENV, I meant to remove it but I guess I failed to when I stuck up this PR. I think the correct thing to check is actually dist.read_text("INSTALLER") for whether it was pip or micropip.

Packages can be installed without virtualenvs (e.g. on PYTHONPATH).

Right... some day would be good to test our system against this, it is a good feature. The Pyodide python cli will correctly pick up PYTHONPATH. Though I'm also not sure if we want to pick up the current PYTHONPATH when locking? It's not 100% clear to me what the right logic is there.

if ver is None:
return None
wheel = next(ver)
await wheel.download({})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this method doesn't yet exist. This call could be very expensive and would need to manage the lifecycle of the downloaded content. It may also need to re-implement things that pip does like caching or re-using pip's cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, this PR is intended as a worst possible implementation to get the discussion moving forward. I think the situation is that we only need to download the wheel if the index does not implement PEP 658. Since PyPI finally does implement PEP 658 we should do that if possible.

As far as I know .dist-info is too weak to lock against other indices because the index is not written anywhere into the .dist-info directory. Maybe we could have a packaging PEP asking for an additional INDEX file that reports the index the file was installed from if any? It would be a bit of a complement to direct_url.json which tells us what to do when we didn't install from an index but rather a url. There's the third case where we installed from a file system path, I think that goes into direct_url.json too? But we can't lock it in any case.

caching or re-using pip's cache

Yes that's a great idea. Does pip expose any of its caching logic? If not is it reasonably stable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pip does expose a list command that will list all matching packages in the cache:

 ~ @ pip cache list foo.bar-0.2 --format abspath
/Users/jaraco/Library/Caches/pip/wheels/0b/ec/8b/f669ae78d94a691e6c63872909196857158b3c20f045058e7b/foo.bar-0.2-py3-none-any.whl
/Users/jaraco/Library/Caches/pip/wheels/10/f7/18/a4a94460da8bf74f841d7df4f10984b730907bea49783615a4/foo.bar-0.2-py3-none-any.whl
/Users/jaraco/Library/Caches/pip/wheels/11/9f/15/3a750be9586c80770d8355ac4e9dc3c58237b1d994f36a2a21/foo.bar-0.2-py3-none-any.whl
/Users/jaraco/Library/Caches/pip/wheels/1b/9d/16/254558b9de54233baf2ef9a35eac16ad3c23a8c5fe4122374f/foo.bar-0.2-py3-none-any.whl
/Users/jaraco/Library/Caches/pip/wheels/20/96/a7/21d787cb6f024ae978823ce36306736a68ffc7c728077edf4d/foo.bar-0.2-py3-none-any.whl
...

It does appear as if that "wheel cache" is only for locally-built wheels.

I haven't found a way to query for pip's http cache. Here's what pip gives for cache info:

 draft @ pip cache info
Package index page cache location (pip v23.3+): /Users/jaraco/Library/Caches/pip/http-v2
Package index page cache location (older pips): /Users/jaraco/Library/Caches/pip/http
Package index page cache size: 883.2 MB
Number of HTTP files: 10736
Locally built wheels location: /Users/jaraco/Library/Caches/pip/wheels
Locally built wheels size: 21.2 MB
Number of locally built wheels: 440

It's possible the http-v2 cache is manifest using some library. If not, there's seemingly no API to access it. My guess, based on the fact that there are two cache locations, is that the http-v2 cache is reasonably stable.

@jaraco
Copy link
Contributor

jaraco commented May 24, 2024

See #110 where I've done some work to refactor the functionality to make it easier to alter behavior such as proposed here.

@hoodmane
Copy link
Member Author

Yeah thanks for the review, this is definitely a DO NOT MERGE pr. The point is that it passes the one test in the worst possible way and gives us a baseline for writing a good implementation.

version=dist.version,
file_name=wheel.url,
install_dir="site",
sha256=wheel.sha256,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC, we are fetching the installed package again to get the URL and the sha256? I am not sure if is it a reliable behavior. What happens if a user installed packages from non-PyPI repositories or from URL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works if they installed from url (at least in principle it can be made to work) because we can get the url from direct_url.json. The multiple indexes problem is bigger because we don't have metadata indicated which index we got it from. I think micropip should store the index url into maybe INDEX.

@pradyunsg @henryiii is there something I'm missing? If someone has a pip.conf with an extra-index-url or passes this as a cli instruction, is it possible to reconstruct which packages in the venv came from which indices? If not maybe we could make a packaging PEP adding INDEX to package metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Micropip.freeze no work?
3 participants