Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(Python): Centralize Python dependencies in pyproject.toml #9535

Closed
wants to merge 33 commits into from

Conversation

zundertj
Copy link
Collaborator

@zundertj zundertj commented Jun 24, 2023

I have noticed several inconsistencies between pyproject.toml and the various requirements files:

  1. matplotlib was in requirements.txt, but not in pyproject.toml EDIT: this was pydantic, not matplotlib
  2. in the requirements file we capped the connectorx version due to a bug in the latest connectorx version, but not in pyproject.toml. I.e., we know we should not use the latest version, but we don't tell our users who do pip install polars[all] and hit the bug we have already identified.
  3. the minimum version of deltalake was inconsistent.

These, and possibly more, point out that having the same information in multiple places will inevitably lead to discrepancies over time.

I started with moving everything to pyproject.toml under the relevant tags. Then, I went into the rabbit hole of pip install + maturin, coming to realise that we cannot install dependencies from pyproject.toml using pip cli without avoiding the Polars build. The polars build uses the slow maturin build command. This seems a known limitation of pip, we cannot modify this behaviour. So I decided to write a bit of code to pull out the dependencies of pyproject.toml, resolve the polars[ ] tags, and pass to pip install.

Pros:

  • no longer inconsistencies between developers and users
  • all dependencies in one place

Cons:

  • some code we have to maintain.
  • Python < 3.11 we need tomlli dependency (for developers only!)

Haven't done docs yet, open to thoughts to see if this is a good way forward.

Important to note that even as a developer, you continue to interact with make, so it is all behind the scenes.

@zundertj zundertj marked this pull request as draft June 24, 2023 10:48
@github-actions github-actions bot added the chore Maintenance work that does not impact the user label Jun 24, 2023
@stinodego
Copy link
Contributor

stinodego commented Jun 24, 2023

I can see the value in this, though I'm not really sure this is the way to go. Have to think on it a little bit.

Don't forget that there is also a requirements.txt in the py-polars/docs directory.

Thinking on it a bit more: I don't see why lint dependencies / docs dependencies should be in the pyproject.toml. These should not be visible for end users - and anything in pyproject.toml will become 'visible'.

The dev dependencies (matplotlib etc.) should be available, and they already were. Maybe we should just abolish requirements-dev.txt and use pip install polars[all] instead?

@zundertj
Copy link
Collaborator Author

zundertj commented Jun 24, 2023

Don't forget that there is also a requirements.txt in the py-polars/docs directory.

I know, that is why I put above:

Haven't done docs yet, open to thoughts to see if this is a good way forward.

Thinking on it a bit more: I don't see why lint dependencies / docs dependencies should be in the pyproject.toml. These should not be visible for end users - and anything in pyproject.toml will become 'visible'.

PEP621 recommends development dependencies to be under a "dev" tag. https://peps.python.org/pep-0621/#recommend-that-tools-put-development-related-dependencies-into-a-dev-extra (EDIT: it was suggested in the discussions as a common workflow, see also Poetry, but it is not recommend for or against in this specific PEP). Note that we do not advertise the tags to end users, so I don't see a problem with it either? In an ideal situation, users can do pip install polars[dev] when they find a bug and want to fix themselves. That is not possible because of the below:

The dev dependencies (matplotlib etc.) should be available, and they already were. Maybe we should just abolish requirements-dev.txt and use pip install polars[all] instead?

This is the problem, you can't do that without triggering maturin build, resulting in very long wait times. I wish this would work fast like our current workflow, then we could just use standard tools rather than custom scripts.

@zundertj
Copy link
Collaborator Author

Some tests fail due to connectorx, the message is to install >=3.1.0 but in the tests we resort to an alpha version of 3.2. So effectively Polars only works with 3.2 alpha versions? Anyways, small detail, can fix if we agree if this PR has any legs.

@zundertj
Copy link
Collaborator Author

zundertj commented Jun 24, 2023

Ok, I think this is now good for review, added docs dependencies, dried the code and fixed small bugs. There is a test error due to utf8 / csv parsing, I dont see how that could be due to this set of changes.

@zundertj zundertj marked this pull request as ready for review June 24, 2023 20:21
@@ -3150,12 +3150,12 @@ def write_database(
"'sqlalchemy' not found. Install polars with 'pip install polars[sqlalchemy]'."
) from exc

engine = create_engine(connection_uri)
engine_sa = create_engine(connection_uri)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mypy complians (rightly so) when SQL Alchemy is installed.

@stinodego
Copy link
Contributor

I've given it some thought, and I really dislike this custom tech. It's an additional hurdle for new contributors, and it's another thing to maintain and document (it's currently not documented anywhere).

In general, I try to avoid custom tech like this unless there is really no way around it. In fact, often the first thing I do when I start working on a project is getting rid of stuff like this. What usually happens is the guy that built it is no longer around and no one really remembers what it was supposed to do. And then after spending hours on finding out what it does, it turns out the whole thing can be replaced using some handy feature that the guy didn't know about, or wasn't available at the time.

However, it's a good find that there is a discrepancy between pyproject.toml and requirements-dev.txt that we should address. There is actually more duplication that could cause issues:

  • show_versions utility function
  • dependencies.py
  • test-python CI workflow (import check)

To me, the first step to address this would be to simply fix the discrepancies (use the right versions everywhere) and document somewhere that these should align: a well-placed comment somewhere should do at first.

Then you can try to get rid of the "dependencies" part of requirements-dev.txt and rely on pyproject.toml. You've looked into this, and it looks like this is not easily achieved because of some limitation with maturin. So if it's really not possible, I'd open an issue with maturin and see if there's a way around this. If not, looks like we're stuck with the duplication for now. Writing your own Poetry-lite script really seems overkill for this issue.

And then maybe there's a way to eliminate duplication in the other places as well - not sure if you've looked into this as well.

So, in my view, the approach in this PR is not the way to go. But as always, I am willing to be convinced otherwise.

@zundertj
Copy link
Collaborator Author

Ok, seems I missed the extras flag for maturin develop, I was focusing on pip install....

@zundertj zundertj marked this pull request as draft June 25, 2023 11:41
@stinodego
Copy link
Contributor

Ok, seems I missed the extras flag for maturin develop, I was focusing on pip install....

Good find, that indeed looks like what we need!

@stinodego
Copy link
Contributor

We should probably keep linting and docs requirements separate. We don't want to have to compile and install Polars every time we run linting / build the docs.

I wish we could use Poetry, it allows you to install dependency groups separately from the main package...

I'd say focus on the development dependencies for now, that already solves all issues listed in the PR description.

@zundertj
Copy link
Collaborator Author

zundertj commented Jul 1, 2023

Ok, this approach has some drawbacks.

First, maturin develop in CI where we currently only lint Python code or docs, etc, takes up a couple extra minutes. When we lint, we don't need Rust/maturin overhead. I guess one could argue that the tests anyways have to build, and all is parallel, so no user time is lost. Or maybe there is more caching possible; I'm not familiar with GH actions though. But overall keeping CI as fast as possible would be a good thing to strive for. We could also leave those alone for now.

Second, pip supports self references (i.e. see pypa/pip#10393). However, somewhere this breaks,
as doing maturin develop --extras all will refer to the released Polars package, not to the local one. So I end up having a list of dev dependencies in various places, which is sort of a half way house between scattering requirements.txt around and the ideal pyproject.toml with self references working solution. Note that maturin develop calls pip to install the dependencies.

Circling back to the original idea I had was to simply use pip install -e .[all,dev-test], but that calls maturin as the build backend to build Polars. I was hoping we could avoid the build, manage that separately, and go straight to the dependencies instead, and a proposal like pypa/pip#11440 would do the trick (see specifically the post here where someone has the exact same use case: pypa/pip#11440 (comment)). The last post in that thread is a script that does the same as what I did.

I have some other ideas still, but it seems we are hitting the limits of what is being supported by pip unfortunately.

zundertj added a commit to zundertj/polars that referenced this pull request Jul 1, 2023
Pyproject.toml + requirements-dev.txt + show_versions aligned. See pola-rs#9535.
zundertj added a commit to zundertj/polars that referenced this pull request Jul 1, 2023
Subset of pola-rs#9535, only moving package dependencies and leaving test+lint+docs alone.
@zundertj
Copy link
Collaborator Author

zundertj commented Jul 1, 2023

In the category hacks that I considered but rejected, maybe useful for others: use the dry-run mode and parse the output to get the list of non-polars packages to install

pip install .[all] --dry-run | tail -1 | sed 's/Would install//g' | sed 's/polars-.*\w//g'

@zundertj
Copy link
Collaborator Author

I'm closing this, as the existing tooling is not sufficient for our use case per the above. The cleanest I can come up with is parse pyproject.toml to grab all our dependencies, which would solve the duplication of dependency specifications, at the expense of custom tooling.

@zundertj zundertj closed this Jul 14, 2023
@stinodego stinodego added the internal An internal refactor or improvement label Jul 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Maintenance work that does not impact the user internal An internal refactor or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants