Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index pip-installed packages #109

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 100 additions & 24 deletions micropip/_commands/freeze.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,110 @@
import importlib.metadata
import json
import sys
from copy import deepcopy
from typing import Any
from importlib.metadata import Distribution
from typing import TypedDict

from packaging.utils import canonicalize_name
from packaging.version import Version

from .._compat import REPODATA_INFO, REPODATA_PACKAGES
from .._utils import fix_package_dependencies
from ..package_index import query_package

IN_VENV = sys.prefix != sys.base_prefix


class PkgEntry(TypedDict):
name: str
version: str
file_name: str
install_dir: str
sha256: str | None
imports: list[str]
depends: list[str]


def get_pkg_entry_micropip(dist: Distribution, url: str) -> PkgEntry:
name = dist.name
version = dist.version
sha256 = dist.read_text("PYODIDE_SHA256")
assert sha256
imports = (dist.read_text("top_level.txt") or "").split()
requires = dist.read_text("PYODIDE_REQUIRES")
if not requires:
fix_package_dependencies(name)
requires = dist.read_text("PYODIDE_REQUIRES")
if requires:
depends = json.loads(requires)
else:
depends = []

return dict(
name=name,
version=version,
file_name=url,
install_dir="site",
sha256=sha256,
imports=imports,
depends=depends,
)


async def get_pkg_entry_pip(dist: Distribution) -> PkgEntry | None:
resp = await query_package(dist.name)
ver = resp.releases.get(Version(dist.version), None)
if ver is None:
return None
wheel = next(ver)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems risky. What if the package doesn't have a wheel, but only an sdist? What if the sdist is chosen over the wheel? What if the package was installed from a local checkout that advertises the same version (but is different)? What if the package was installed from a local checkout that advertises a local version (that's not present in PyPI)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I agree with these concerns. I agree that there needs to be some sort of architecture to our packaging system and currently there is no architecture merely an accumulation of behaviors. I will answer your questions in defense of this approach because I think the answers contain interesting information about our domain logic and not because this approach is any good.

What if the package doesn't have a wheel, but only an sdist?

This is a great question. It will fail. If the wheel is platformed, then the pip build frontend cannot build it. The only way to build a Pyodide platformed wheel is with our own pyodide build frontend. I looked into patching pip so that it uses our frontend but I am pretty sure it is not a reasonable task without support from the pip team. But for a non-platformed py3-none-none package, it can just upload an sdist and then pip will download it and install it. The problem is that this whole lock file idea is predicated on the idea that the packages came from somewhere, and that we didn't do a local build. Probably if we find packages that were installed from sdist, then freeze() should raise an error complaining specifically about this.

What if the package was installed from a local checkout that advertises the same version (but is different)?

Same issue, can only install from a url. So yeah we would indeed freeze a different wheel than was actually installed. We should probably check for this? Note that pip freeze also can't save you from this situation, although in normal Python it is fine if pip itself installed the package from sdist. But the fact that if you do

pip freeze > requirements.txt
python -m venv .venv-new
.venv-new/bin/pip install -r requirements.txt

then .venv-new can end up being different than the current environment is indicating that pip freeze does not generate a true lock but something weaker. Our aspiration is to generate a lock and we should error with an explanation if we cannot generate one.

await wheel.download({})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this method doesn't yet exist. This call could be very expensive and would need to manage the lifecycle of the downloaded content. It may also need to re-implement things that pip does like caching or re-using pip's cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, this PR is intended as a worst possible implementation to get the discussion moving forward. I think the situation is that we only need to download the wheel if the index does not implement PEP 658. Since PyPI finally does implement PEP 658 we should do that if possible.

As far as I know .dist-info is too weak to lock against other indices because the index is not written anywhere into the .dist-info directory. Maybe we could have a packaging PEP asking for an additional INDEX file that reports the index the file was installed from if any? It would be a bit of a complement to direct_url.json which tells us what to do when we didn't install from an index but rather a url. There's the third case where we installed from a file system path, I think that goes into direct_url.json too? But we can't lock it in any case.

caching or re-using pip's cache

Yes that's a great idea. Does pip expose any of its caching logic? If not is it reasonably stable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pip does expose a list command that will list all matching packages in the cache:

 ~ @ pip cache list foo.bar-0.2 --format abspath
/Users/jaraco/Library/Caches/pip/wheels/0b/ec/8b/f669ae78d94a691e6c63872909196857158b3c20f045058e7b/foo.bar-0.2-py3-none-any.whl
/Users/jaraco/Library/Caches/pip/wheels/10/f7/18/a4a94460da8bf74f841d7df4f10984b730907bea49783615a4/foo.bar-0.2-py3-none-any.whl
/Users/jaraco/Library/Caches/pip/wheels/11/9f/15/3a750be9586c80770d8355ac4e9dc3c58237b1d994f36a2a21/foo.bar-0.2-py3-none-any.whl
/Users/jaraco/Library/Caches/pip/wheels/1b/9d/16/254558b9de54233baf2ef9a35eac16ad3c23a8c5fe4122374f/foo.bar-0.2-py3-none-any.whl
/Users/jaraco/Library/Caches/pip/wheels/20/96/a7/21d787cb6f024ae978823ce36306736a68ffc7c728077edf4d/foo.bar-0.2-py3-none-any.whl
...

It does appear as if that "wheel cache" is only for locally-built wheels.

I haven't found a way to query for pip's http cache. Here's what pip gives for cache info:

 draft @ pip cache info
Package index page cache location (pip v23.3+): /Users/jaraco/Library/Caches/pip/http-v2
Package index page cache location (older pips): /Users/jaraco/Library/Caches/pip/http
Package index page cache size: 883.2 MB
Number of HTTP files: 10736
Locally built wheels location: /Users/jaraco/Library/Caches/pip/wheels
Locally built wheels size: 21.2 MB
Number of locally built wheels: 440

It's possible the http-v2 cache is manifest using some library. If not, there's seemingly no API to access it. My guess, based on the fact that there are two cache locations, is that the http-v2 cache is reasonably stable.

requires = [req.name for req in wheel.requires(set())]
return dict(
name=dist.name,
version=dist.version,
file_name=wheel.url,
install_dir="site",
sha256=wheel.sha256,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC, we are fetching the installed package again to get the URL and the sha256? I am not sure if is it a reliable behavior. What happens if a user installed packages from non-PyPI repositories or from URL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works if they installed from url (at least in principle it can be made to work) because we can get the url from direct_url.json. The multiple indexes problem is bigger because we don't have metadata indicated which index we got it from. I think micropip should store the index url into maybe INDEX.

@pradyunsg @henryiii is there something I'm missing? If someone has a pip.conf with an extra-index-url or passes this as a cli instruction, is it possible to reconstruct which packages in the venv came from which indices? If not maybe we could make a packaging PEP adding INDEX to package metadata.

imports=[],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like imports is stubbed here. It should probably query top_level.txt, same as micropip.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unfortunate that top_level.txt is setuptools only. As I understand it, the other build backends don't have anything similar. Do you know why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think top_level.txt is not very good for querying import names, and probably we should stop using it in micropip.

Ref: pyodide/pyodide#3006 (review)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several good examples in that thread. Also, namespace packages aren't reflected accurately either:

 @ pip-run jaraco.functools -- -c "import importlib.metadata as md; print(md.distribution('jaraco.functools').read_text('top_level.txt'))"
jaraco

(jaraco.functools is one of many packages that exposes the top-level jaraco package; the unique offering of jaraco.functools is the python package of the same name)

depends=requires,
)


async def freeze2() -> str:
"""Produce a json string which can be used as the contents of the
``repodata.json`` lock file.

If you later load Pyodide with this lock file, you can use
:js:func:`pyodide.loadPackage` to load packages that were loaded with :py:mod:`micropip`
this time. Loading packages with :js:func:`~pyodide.loadPackage` is much faster
and you will always get consistent versions of all your dependencies.

You can use your custom lock file by passing an appropriate url to the
``lockFileURL`` of :js:func:`~globalThis.loadPyodide`.
"""
hoodmane marked this conversation as resolved.
Show resolved Hide resolved
packages = deepcopy(REPODATA_PACKAGES)
for dist in importlib.metadata.distributions():
name = dist.name
url = dist.read_text("PYODIDE_URL")
if url:
pkg_entry = get_pkg_entry_micropip(dist, url)
elif IN_VENV:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure why IN_VENV is relevant. Packages can be installed without virtualenvs (e.g. on PYTHONPATH). And virtualenvs can expose site-packages.

What is the motivation for varying on IN_VENV?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is wrong to branch on IN_VENV, I meant to remove it but I guess I failed to when I stuck up this PR. I think the correct thing to check is actually dist.read_text("INSTALLER") for whether it was pip or micropip.

Packages can be installed without virtualenvs (e.g. on PYTHONPATH).

Right... some day would be good to test our system against this, it is a good feature. The Pyodide python cli will correctly pick up PYTHONPATH. Though I'm also not sure if we want to pick up the current PYTHONPATH when locking? It's not 100% clear to me what the right logic is there.

res = await get_pkg_entry_pip(dist)
if not res:
continue
pkg_entry = res
else:
continue

packages[canonicalize_name(name)] = pkg_entry

# Sort
packages = dict(sorted(packages.items()))
package_data = {
"info": REPODATA_INFO,
"packages": packages,
}
return json.dumps(package_data)
hoodmane marked this conversation as resolved.
Show resolved Hide resolved


def freeze() -> str:
Expand All @@ -24,32 +122,10 @@ def freeze() -> str:
packages = deepcopy(REPODATA_PACKAGES)
for dist in importlib.metadata.distributions():
name = dist.name
version = dist.version
url = dist.read_text("PYODIDE_URL")
if url is None:
continue

sha256 = dist.read_text("PYODIDE_SHA256")
assert sha256
imports = (dist.read_text("top_level.txt") or "").split()
requires = dist.read_text("PYODIDE_REQUIRES")
if not requires:
fix_package_dependencies(name)
requires = dist.read_text("PYODIDE_REQUIRES")
if requires:
depends = json.loads(requires)
else:
depends = []

pkg_entry: dict[str, Any] = dict(
name=name,
version=version,
file_name=url,
install_dir="site",
sha256=sha256,
imports=imports,
depends=depends,
)
pkg_entry = get_pkg_entry_micropip(dist, url)
packages[canonicalize_name(name)] = pkg_entry

# Sort
Expand Down
Loading