Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: out-of-tree Pyodide builds in CI for pandas #57891

Closed
1 of 3 tasks
agriyakhetarpal opened this issue Mar 18, 2024 · 11 comments · Fixed by #57896
Closed
1 of 3 tasks

ENH: out-of-tree Pyodide builds in CI for pandas #57891

agriyakhetarpal opened this issue Mar 18, 2024 · 11 comments · Fixed by #57896
Labels
Build Library building on various platforms Enhancement

Comments

@agriyakhetarpal
Copy link
Contributor

agriyakhetarpal commented Mar 18, 2024

Feature Type

  • Adding new functionality to pandas
  • Changing existing functionality in pandas
  • Removing existing functionality in pandas

Problem Description

This feature proposes adding out-of-tree Pyodide builds for pandas on its own CI. As opposed to in-tree builds in the Pyodide packages repository (see the recipe), this will enable building WASM wheels (which are 32-bit)1 via the Emscripten toolchain on GitHub Actions.

In my most recent work assignment, I am working on improving the interoperability for the Scientific Python ecosystem of packages with Pyodide and with each other (Quansight-Labs/czi-scientific-python-mgmt#18), which shall culminate with efforts towards bringing interactive documentation for these packages where they can then be run in JupyterLite notebooks, through nightly builds and wheels for these packages pushed to PyPI-like indices on Anaconda.org (Quansight-Labs/czi-scientific-python-mgmt#19) at and during a later phase during the project.

Feature Description

  1. A CI pipeline via GitHub Actions that builds and tests pandas in an activated virtual environment created and handled by Pyodide.
  2. Fixing up the tests wherever applicable, conforming to limitations that come from the Emscripten toolchain (current limitations include the lack of threading support and for running subprocesses, limited file system access, and others).

Alternative Solutions

N/A

Additional Context

I notice that the build and the tests are going to be split into separate packages in #53007, which can be handled (I assume that the imminent pandas-tests package will be a pure Python one, which does not need compilation for Pyodide – they are supported by default).

cc @lithomas1 for any additional comments here, as required. Thomas and I have been in conversation about this just about a fortnight ago, and we had agreed that reducing the size of the wheels would be helpful for the WebAssembly rendition of pandas as well – it is particularly useful where reducing bandwidth usage is in the question. I am happy to follow along on further developments on that regard.

Footnotes

  1. It is to be noted that DEPS: drop 32-bit support #15889 removed 32-bit support by removing the wheels, but if I understand correctly, pandas continues to be supported when building from source on 32-bit platforms.

@agriyakhetarpal agriyakhetarpal added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 18, 2024
@agriyakhetarpal
Copy link
Contributor Author

I have experimented with this in a branch on my fork, where I have managed to bring the test suite down to just 31 failures by skipping certain tests and fixing a few other ones – I can open a draft PR for this for the purposes of visibility as of now, and work towards getting down to zero failures with the help and support of the maintainers and core developers.

@agriyakhetarpal
Copy link
Contributor Author

xref some other discussions about Pyodide as follows (some of them coming from the PyArrow requirement that revolves around PDEP-10):

  1. FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466
  2. Make pyarrow a required dependency #52509
  3. PDEP-10: Add pyarrow as a required dependency #52711
  4. DOC: Add an interactive shell powered by JupyterLite to the website #47428

The issue at hand with PyArrow as a required dependency is the increased size of the download, which brings negative implications for using pandas in a JupyterLite-enabled kernel/notebook – I haven't followed up on the NumPy string DType improvements that have come up recently, but I would happy if someone could fill me in on the latest developments :)

@lithomas1 lithomas1 added Build Library building on various platforms and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 18, 2024
@phofl
Copy link
Member

phofl commented Mar 18, 2024

I am not super optimistic about having to debug pyodide stuff in my PRs, I am really hesitant on adding official support for the platform

@agriyakhetarpal
Copy link
Contributor Author

Hi, @phofl – I totally understand that. It does come with its concerns related to maintenance; if patches might be required to disable certain functionality and debugging them is indeed painful. I would be happy to take on that sort of maintenance if needed, but realistically, yes, it won't be possible to do so for me or anyone else every time there is a failure or if any critical changes are required in a jiffy. I will keep my PR marked as draft for now, but I hope that the team changes their mind :)

That being said, if it is ever enabled it will still be regarded as an "experimental" platform though, and it will stay that way for a long while until things in the Pyodide ecosystem become stable and we have a v1 release. But I suppose it also depends upon the popularity of the platform continues to evolve in the coming years. Here's what the scikit-learn developers think about that if you wish to have a read: scikit-learn/scikit-learn#23727

@mroeschke
Copy link
Member

How far is Numpy's progress on this front? I would also be hesitant with officially supporting this too until pandas' required dependencies have WASM support (at least CI testing)

@lithomas1
Copy link
Member

numpy has a workflow testing this already
https://github.com/numpy/numpy/blob/main/.github/workflows/emscripten.yml.

@agriyakhetarpal
Copy link
Contributor Author

agriyakhetarpal commented Mar 19, 2024

Thank you, @lithomas1. Yes, @mroeschke, NumPy is currently testing against Pyodide for all PRs with the link to the workflow as shared above, restarting with this PR: numpy/numpy#25894 – thanks to @rgommers for his help. It did require a bit of back and forth with the migration to the new build system, but that's not really required with pandas and with my limited experience I have seen that Emscripten handles Cython code well enough. A lot of the tests for NumPy were being skipped through a WASM platform check and a pytest decorator/marker already before the PR came to fruition, which would ideally be the case here too (that's what has been done in my PR wherever applicable).

@rgommers
Copy link
Contributor

A bit of context:

  • Both NumPy and scikit-learn have had a CI job for running with Pyodide for quite a while now (maybe 1.5 years, sklearn maybe longer).
    • NumPy only disabled it during the transition from setuptools to Meson, to not have to think about it for a while. Other than that it's been quite unproblematic.
  • Adding a CI job is not a commitment, so I wouldn't call it "officially supported". Putting wheels up on PyPI for a new platform is a commitment, because once they're there and users rely on them, you can't easily stop producing them. However, PyPI does not allow wasm32 wheels, so that is not what is proposed here.
    • For this CI job, the numpy (and pyarrow and other deps) to build/test against would come from Pyodide itself.
  • Why is this a useful exercise? At the moment, to ensure the Pandas test suite passes on Pyodide. Later, to also enable interactive docstring examples with little extra work, through support in the pydata-sphinx-theme and jupyterlite-sphinx.
  • If you don't want to add the CI job before other projects have more experience or before the interactive docs are fully ready, then just merging the test suite improvements is also a useful thing to do.

Since @agriyakhetarpal is ready to do the work, including changes/improvements after initial merge, it seems quite low-risk to merge the job - and if it turns out to be a burden, then you can simply decide to disable the job again.

@lithomas1
Copy link
Member

I am not super optimistic about having to debug pyodide stuff in my PRs, I am really hesitant on adding official support for the platform

That's a fair point.

Is there an official Docker image for pyodide? I personally would be comfortable debugging if this was available, not sure about others, though.

@mroeschke
Copy link
Member

Since other pydata libraries have been testing it, I would be OK adding it to the CI but, if it comes to it, allow it to fail without blocking a merge

@agriyakhetarpal
Copy link
Contributor Author

Is there an official Docker image for pyodide? I personally would be comfortable debugging if this was available, not sure about others, though.

@lithomas1 there is a Pyodide Docker image, yes: https://pyodide.org/en/stable/development/building-from-sources.html#using-docker. I don't use it personally and I prefer debugging on GitHub Actions because they don't have an aarch64 image yet (pyodide/pyodide#1386), and emulation is slow up to multiple magnitudes for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Library building on various platforms Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants