Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better wheel hosting solution #3049

Open
rth opened this issue Aug 31, 2022 · 30 comments
Open

Better wheel hosting solution #3049

rth opened this issue Aug 31, 2022 · 30 comments

Comments

@rth
Copy link
Member

rth commented Aug 31, 2022

Now that we have switched to wheels for packages, our current distribution model reaches its limits as it doesn't allow,

  • having multiple available versions of a package for a given Python/Emscripten version
  • have the package accessible as soon as it's built
  • allow users to upload wheels from their CI (e.g. using cibuildwheel once it works).

This issue aims to discuss what an improved hosting solution could look like and evaluate several possible directions. Of course, the question remains whether the added benefits would be worth the extra maintenance effort.

From a usage perspective, I think what we want is a PyPI like mirror, where users (us included) can upload packages with twine using an API key. Internally packages would still be uploaded to the same S3 bucket and distributed via JsDelivr. So the system distributing packages (i.e. S3 + JsDelivr) is separate from the website of the package index. This way we can guarantee very good package availability, even if there is some maintenance downtime time on the package index website.

Aside from the above-listed functionality, I think we would need,

  • the ability to upload files that are not wheels (e.g. .zip files) -> though this needs to be discussed, maybe wrapping those files as wheels would be better
  • users can create accounts to upload their packages -> to be discussed, this would add maintenance cost
  • ability to upload repodata.json ideally without using S3 credentials directly
  • need to figure out what we do with unvendored test packages, so they are still part of the dependency tree but maybe they shouldn't have a separate page

Bonus features,

  • have some way to inject extra information on the package page (e.g. extra comments about what works and what doesn't from the meta.yaml, or automatic package analysis about potentially unsupported functionality)

If we want to use an existing OSS solution, there are the following possibilities,

  • PyPI warehouse: it's fairly specific to PyPI.org, very complex, and is not intended to be used for anything else than PyPI.org
  • devpi
  • pypi-server
  • code behind piwheels
  • custom solution with only the subset of the functionality we need (still a lot of work)

I'll later add below my evaluation of these solutions.

At the same time, we should keep in mind that in the long term wasm/emscripten wheels might be supported by PyPI, so maybe it's not worth spending too much effort on this. However, even if PyPI does support them, it might not include web-specific optimization that we are able to do.

@bollwyvl
Copy link
Contributor

bollwyvl commented Sep 7, 2022

Yeah, multiple versions are pretty sticky: in jupyterlite, we're currently working through the growing pains of a major version of a "special" upstream (ipywidgets) which has a javascript-level dependency (widgetsnbextension), and will need to do some weird shims.

It's not that bad, though, as we've been maintaining a "collapsed" warehouse-like file structure which doesn't have some of the features of the conda-like repodata.json, but does allow for multiple versions of multiple packages.

To that end: the eventual solution should:

  • be something that speaks an existing, well-known, efficient (as possible) package management format and syntax
  • allows for multiple sources of (multiple versions) of packages
  • tracks ABI-level data
  • can be automated to a great extent with existing tooling

The closest I've come to this, is, no surprise here, the conda, conda-build, and ultimately, conda-forge ecosystem.

This might work, today on anaconda.org: it will actually host wheels, and provide a pip-compatible (but probably not a warehouse-compatible) API:

https://docs.anaconda.com/anacondaorg/user-guide/tasks/work-with-packages/#using-package-managers

It offers free (as in beer), (basically) unlimited package hosting for open source, as well as private packages (though it gets a bit hairy). There is an on-going effort to provide an open source implementation of that site's capabilities at https://github.com/mamba-org/quetz, but to my knowledge, it doesn't offer wheels.

Of course... the real step there would then be to eschew the wheel ecosystem entirely, and use the conda(-forge) ecosystem more directly. The double-edged sword of the "real" repodata.json is that it works pretty well for channel a few thousand package * arch * python * os * version, but starts to fall apart at the hundreds of thousands (conda-forge/noarch/repodata.json is 200mb, uncompressed). Even making that 10x better would still be unreasonable for fully client-side building.

@hoodmane
Copy link
Member

hoodmane commented Sep 7, 2022

@bollwyvl I think we want to stick to wheels. Most Python packages can make wheels automatically with python -m build in the package directory. With pyodide build it's possible to build emscripten wheels for a lot of packages without any modification or with minor fixes. Our hope is to get package maintainers to build and test emscripten wheels in their CI. Would it be possible to get package maintainers to build and test .conda archives? As far as I can tell, none of the major packages do this.

@bollwyvl
Copy link
Contributor

bollwyvl commented Sep 7, 2022

hope is to get package maintainers to

I salute your optimism!

possible to get package maintainers to

No, they won't do that, either, but a few thousand packages in, conda-forge has demonstrated that a community-based, distributed model with heavy automation on donated CI, and donated cold storage can achieve substantial, reproducible things at scale, and keep them up-to-date with best practices.

@rth
Copy link
Member Author

rth commented Sep 7, 2022

Thanks for your summary about packaging on the jupyter side, it's helpful!

Emscripten-forge is going in the conda direction, with a planned contribution to conda-forge and I think it's great that people are exploring this idea.

For our side, we decided to go in the direction of wheels. Things have improved a lot there as well with cibuildwheels and related projects. A lot of packages are on PyPi only, and people still need the ability to install packages from there (or some private repositories that never going to be on conda-forge, particularly pure Python ones). About hosting I think we would rather rely on some community solution, we have a good agreement with JsDelivr CDN with no limits on the bandwidth so far.

@freakboy3742
Copy link

As a data point - BeeWare uses an anaconda.org repository to host iOS binary wheels, and it's been working really well. Having a similar collection available for Pyodide/emscripten wheels would be exceedingly helpful for some BeeWare's packaging workflows.

@rth
Copy link
Member Author

rth commented Apr 25, 2023

Yes, I'm aware of that possibility, and I hear good things about it. For Pyodide we currently need more than wheels (there are also various .js files) for which we probably need JsDelivr in any case. Given the weight of Anaconda in this space, I'm also a bit reluctant to host our packages there to avoid too much centralization (and Pyscript might do something in that direction eventually I guess). As a for using it for hosting user-built wheel, certainly, we should probably mention this possibility.

So in the end I'm not sure what would be the outcome of this issue. There seem to be a growing sentiment that we don't want to manage or review a service where people can arbitrarily upload packages. We still need JsDelivr in any case for JS files. And PyPI support seem within reach on the medium term. Better support for third party hosting services is something we also need to improve in pyodide/micropip#62.

So maybe the outcome is just to recommend wheel hosting services people can use to host wheels (including anaconda.org)

@ryanking13
Copy link
Member

So maybe the outcome is just to recommend wheel hosting services people can use to host wheels (including anaconda.org)

Yes, that was what I was thinking in pyodide/micropip#62. What we do is to support standard PyPI APIs (Simple API) + allow people to use alternative registries. Then people can choose any hosting solutions they want until PyPI supports hosting Emscripten wheels.

@freakboy3742
Copy link

@rth Totally understood that you might not want to visibly use an Anaconda service for this; I mention it only because (a) BeeWare is using it for other purposes, and (b) if the constraint was having access to a free simple index hosting option, the option is there.

In terms of my own personal wishlist - I acknowledge that there's a need for other non-wheel files to be hosted, but having the existing, officially published binary wheels available as a simple index would (AFAICT) do everything I'm currently looking for. Maintaining an unofficial mirror of the published Pyodide wheels is one of the options on the table, but I'd vastly prefer to avoid if I can. Longer term, having the ability for other users to upload wheels would also be great (but also moves into the territory of "build your own PyPI"); but in the short term, having pip-compatible access to the wheels that already exist would be more than sufficient.

@hoodmane
Copy link
Member

Since we're already generating a simple index I agree that we should deploy it somewhere so you can use it.

@hoodmane
Copy link
Member

hoodmane commented Apr 26, 2023

Longer term, having the ability for other users to upload wheels would also be great (but also moves into the territory of "build your own PyPI")

From my discussions with pypa members at PyCon, I am optimistic that we will most likely be able to upload Pyodide wheels to pypi a year from now.

@rth
Copy link
Member Author

rth commented Apr 26, 2023

a simple index I agree that we should deploy it somewhere so you can use it.

The simple index is really simple BTW, nothing really prevents us from exposing a simple index for the files we built already now. Unfortunately JsDelivr will not allow us to distribute .html files for security reasons. But we can probably allow access to those via a different subdomain if it's useful. The other alternative is put some very simple service that takes a repodata.json and translates it into a simple API.

@hoodmane
Copy link
Member

We already are generating a simple index we just don't serve it.

@rth
Copy link
Member Author

rth commented Apr 28, 2023

@hoodmane Do you know whom we could ask if it's possible to add CORS headers to Anaconda.org wheel hosting solution? We tried it with @lesteve but it's missing CORS currently.

@hoodmane
Copy link
Member

hoodmane commented Apr 28, 2023

Can you paste the URL that is missing the headers here and I'll ask someone to look at it?

@rth
Copy link
Member Author

rth commented Apr 28, 2023

For instance if one does,

pip install httpie
http --follow GET https://anaconda.org/beeware/regex/2021.8.28/download/regex-2021.8.28-0-cp310-cp310-ios_12_0.whl

there are no CORS headers in response. As opposed to PyPI,

http --follow GET https://files.pythonhosted.org/packages/bb/4f/65c14619af6e6640d8eac947a8322582207beed27fb29fbb927653a51b38/regex-2023.3.23-cp310-cp310-musllinux_1_1_aarch64.whl

which yields,

Access-Control-Allow-Headers: Range
Access-Control-Allow-Methods: GET, OPTIONS
Access-Control-Allow-Origin: *

I think just the last line would be sufficient to set in the return headers.

For anaconda.org, since it returns a 302 redirect to https://binstar-cio-packages-prod.s3.amazonaws.com I think both endpoints would need to return the Access-Control-Allow-Origin header.

@hoodmane
Copy link
Member

I think pip needs range because of lazy wheel. Not a bad thing to add anyways.

@rth
Copy link
Member Author

rth commented Apr 28, 2023

For anaconda.org, since it returns a 302 redirect

Though looking at the MSDN docs it's not clear if CORS in combination with a redirect to a different domain is even allowed.

@lesteve
Copy link
Contributor

lesteve commented May 23, 2023

@hoodmane Do you know whom we could ask if it's possible to add CORS headers to Anaconda.org wheel hosting solution? We tried it with @lesteve but it's missing CORS currently.

@hoodmane did you hear back on this by any chance? In scikit-learn we are now building a Pyodide wheel in the CI scikit-learn/scikit-learn#26374. It would be very nice to upload it to https://anaconda.org/scipy-wheels-nightly as we (and other scientific Python packages) do for other development wheels and have it installable in a notebook similar to something like this:

%pip install https://anaconda.org/scipy-wheels-nightly/scikit-learn/1.3.dev0/download/scikit_learn-1.3.dev0-cp311-cp311-emscripten_3_1_32_wasm32.whl 

@rth
Copy link
Member Author

rth commented Jun 22, 2023

A gentle ping @hoodmane on the above :) Or just please give us the contact of the person whom we could ask this. It would really be nice to upload dev wheels for scientific packages there.

@hoodmane
Copy link
Member

I made the request and I think they added them.

@hoodmane
Copy link
Member

hoodmane commented Jun 23, 2023

Nevermind, it has not been done I tested wrong.

@mahmoud
Copy link

mahmoud commented Nov 29, 2023

We already are generating a simple index we just don't serve it.

Hey there! Just checking in re: whether the simple index found hosting somewhere. Thanks!

@lesteve
Copy link
Contributor

lesteve commented May 31, 2024

Hi @fpliger, would it be possible to have an idea whether having CORS headers for anaconda.org is still kind of moving forward? This is a follow-up of pyodide/micropip#101 (comment) but using this issue to try to keep the conversation in a centralized place.

Of course, I completely understand there might be technical complexities + political challenges and that it is probably not at the top of Anaconda priorities.

It still would be great to have a sense whether this is something that may happen at one point, although I completely understand that it is hard to give a precise timeline.

In my main use case for scikit-learn (that I tried to sum up in pyodide/micropip#101 (comment)) if I get some kind of signal that this is going to happen in something like say 6 months, I may actually look for other work-arounds in the mean-time. The first that comes to mind is to have Pyodide wheels in a github repo and use jsdelivr CDN to be able to micropip.install it in a JupyterLite notebook.

If the discussion is somewhat easier to have outside of a public issue tracker, I'd love to be part of the it, and I guess others like @rth may be as well.

@hoodmane
Copy link
Member

I think we're getting quite close to being able to upload wheels to pypi. I'd appreciate it if the pypa people can share their opinions on what else needs to be done first @henryiii @di. @henryiii suggested that in his opinion the most important step was inclusion in cibuildwheel which we finally merged just the other day:
pypa/cibuildwheel#1456

@agriyakhetarpal can you tell us which scientific computing packages already test against us in CI and would be ready to upload a wheel to pypi, and which ones are on your list to add?

@agriyakhetarpal
Copy link
Member

agriyakhetarpal commented May 31, 2024

Thanks for the ping, @hoodmane! Here's a list that I keep maintained, by no means exhaustive – I have yet to do some of these, and some of them (like awkward by @henryiii and scikit-learn by @lesteve) have already been implemented by others:

Note

This table is also mirrored at Quansight-Labs/czi-scientific-python-mgmt#18

Package name Out-of-tree WASM builds Anaconda.org scheduled uploads
NumPy numpy/numpy#25894, numpy/numpy#26564, numpy/numpy#26570 numpy/numpy#26134, numpy/numpy#27353
PyWavelets PyWavelets/pywt#701, PyWavelets/pywt#744 PyWavelets/pywt#710
pandas pandas-dev/pandas#57896 pandas-dev/pandas#58647
awkward and awkward-cpp scikit-hep/awkward#2062 (not by me) Planned
scikit-learn ✅ (improvement via scikit-learn/scikit-learn#29791 in progress) Planned
scikit-image ✅ (setup: scikit-image/scikit-image#7350, improvement: scikit-image/scikit-image#7525) In progress at scikit-image/scikit-image#7440
statsmodels ✅ (setup: statsmodels/statsmodels#9270, improvement: statsmodels/statsmodels#9343) MacPython/statsmodels-wheels#161
Zarr zarr-developers/zarr-python#1903, needs #4817 to be released Planned
numcodecs zarr-developers/numcodecs#529, ready for review Planned
SciPy Planned Planned
SymPy sympy/sympy#27183 sympy/sympy#27186 (implemented by a maintainer), python-flint (dependency of SymPy) WASM builds left – discussion underway in flintlib/python-flint#234
Matplotlib matplotlib/matplotlib#27870, being tracked in matplotlib/matplotlib#29093 (not implemented by me) Planned in matplotlib/matplotlib#29093
h5py and libhdf5 h5py/h5py#2397 Planned
PyTables Planned Planned

Based on https://anaconda.org/scientific-python-nightly-wheels/, I haven't looked into Xarray or Uproot so far, but happy to do so if needed.

Also cc: @rgommers. I have tested NumPy's WASM build with cibuildwheel (numpy/numpy#26570 ✅), PyWavelets (PyWavelets/pywt#744 ✅), pandas has one PR doing two things (pandas-dev/pandas#58647 ✅), and Scikit-HEP packages like boost-histogram have been completed as well (scikit-hep/boost-histogram#935 ✅, scikit-hep/boost-histogram#938 ✅).

@lesteve
Copy link
Contributor

lesteve commented May 31, 2024

(side-comment: for scikit-learn there is no anaconda.org scheduled uploads because I was kind of waiting for the anaconda.org CORS headers situation to clarify. Adding scheduled uploads is probably doable in finite time though)

@agriyakhetarpal
Copy link
Member

agriyakhetarpal commented May 31, 2024

Ah, I looked at pyodide/micropip#80 and I thought they were being uploaded – updated the entry in the table to "Planned". I'm happy to write a PR for that if you'd like, or step back if you wish to do this yourself!

@henryiii
Copy link
Contributor

henryiii commented May 31, 2024

iminuit also reports TypeError: getWasmTableEntry(...) is not a function: https://github.com/scikit-hep/iminuit/actions/runs/9321728796/job/25661476682?pr=995

Are these out-of-tree builds with pyodode 0.26? So far I'm 0 for 2 on 0.26 builds. Both packages are scikit-build-core based and use pybind11 and have exceptions enabled. At least boost-histogram was fine with <0.26 out-of-tree. I can't find the awkward build mentioned, the Awkward out of tree build is still 3.11 in the docs.

Maybe #2964 is related?

@agriyakhetarpal
Copy link
Member

agriyakhetarpal commented May 31, 2024

Are these out-of-tree builds with pyodode 0.26?

Yes – they are for NumPy, PyWavelets, scikit-image, Zarr, numcodecs, and pandas (available in the PRs for the latter four, not yet merged). I see that awkward is yet to be upgraded to 0.26.0 and it doesn't have WASM nightlies (updated the table), but I notice that awkward-cpp in-tree is failing, because of boost-histogram (in #4816).

@lesteve
Copy link
Contributor

lesteve commented Jun 1, 2024

I think we're getting quite close to being able to upload wheels to pypi.

Great to hear that! Note that even if PyPI Pyodide wheels is possible, anaconda.org would still be a "nice to have" for nightly wheels. To add more details, in an ideal world the scikit-learn dev website would use the nightly wheel for its interactive examples, which is useful for examples using new features in the development branch. We would like to use anaconda.org because that's where all the other nighly wheels are and PyPI does not seem a good fit for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants