Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: make NumPy, pandas, and Arrow deps optional #152

Merged
merged 11 commits into from
Sep 13, 2024

Conversation

jitingxu1
Copy link
Collaborator

@jitingxu1 jitingxu1 commented Sep 5, 2024

For Ibis default installation, it removed numpy and pandas as required dependencies: ibis-project/ibis#9564

I did not have experience with package dependency managment, I have two options

  1. add pandas and numpy as required dependency in IbisML
  2. add a backend (duckdb) for ibis installation, anyhow, we need to install one backend (duckdb) to run IbisML

I have no idea which one is better, or there is other better ones,

I took option 2, please take a look @lostmygithubaccount @deepyaman

If it is not urgent, we could wait for Deepyaman back.

----update---

Final decision is to make numpy, pandas, pyarrow imported in the function scope

resolve #151

@codecov-commenter
Copy link

codecov-commenter commented Sep 5, 2024

Codecov Report

Attention: Patch coverage is 82.14286% with 5 lines in your changes missing coverage. Please review.

Project coverage is 85.24%. Comparing base (2f2aaaf) to head (6c7f9f1).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
ibis_ml/core.py 80.00% 3 Missing ⚠️
ibis_ml/__init__.py 0.00% 1 Missing ⚠️
tests/test_optional_dependencies.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #152      +/-   ##
==========================================
+ Coverage   84.94%   85.24%   +0.29%     
==========================================
  Files          26       27       +1     
  Lines        1920     1938      +18     
==========================================
+ Hits         1631     1652      +21     
+ Misses        289      286       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@deepyaman
Copy link
Collaborator

deepyaman commented Sep 9, 2024

We should not force the user to install the DuckDB backend, when IbisML can work without it.

Option 1 is a better answer, but it still forces the user to install NumPy and pandas; do they need these dependencies unless they are using pandas or NumPy inputs or scikit-learn?

The correct answer is probably to move the imports into the scope where they're used (see how it's done for something like Polars).

@jitingxu1
Copy link
Collaborator Author

local import numpy pandas pyarrow

Copy link
Collaborator

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change request. Can you also add a test case to make sure things work without the pandas/NumPy dependency?

@@ -181,8 +190,11 @@ def _(X, y=None, maintain_order=False):
return ibis.memtable(table), targets, index


@normalize_table.register(pa.Table)
@normalize_table.register("pa.Table")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes for PyArrow are not necessary, as it is a required dependency. Can you undo them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find it is required dependency in Ibis and IbisML

https://github.com/ibis-project/ibis/blob/main/pyproject.toml#L49 was marked as optional.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense; I forgot about ibis-project/ibis#9552.

Guess this is fine then, to avoid having to maintain PyArrow bounds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bad idea to not explicitly include dependencies that you are explicitly importing. You shouldn't really ever depend on another project to have specific dependencies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bad idea to not explicitly include dependencies that you are explicitly importing. You shouldn't really ever depend on another project to have specific dependencies.

Do you mean we should explicitly include the dependency in IbisML. It is convenient users successfully imported IbisML, but when they use it, they'll get ModuleNotFound error, have to install it.

@deepyaman do you have any comment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't explicitly require pandas/NumPy; we don't need to include those dependencies, unless a user is using a backend/ML framework that requires it. Those are specified in extras.

For PyArrow... trying to think, can you reasonably use IbisML without PyArrow? I took some looks at the linked PR in Ibis, and it seems there was some use case from people using Ibis internally in their product that could not need the PyArrow dependency. Is that also possible with IbisML? I wasn't completely sure...

In this case, I'm personally OK with leaving it as is, since IbisML isn't just getting a transitive dependency from Ibis—it completely relies on Ibis under the hood.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To unblock the release, let's keep it as it is now. We could change this later if we have new findings.

@lostmygithubaccount
Copy link
Member

should we be raising something if the relevant libraries aren't installed? or is the import error a user would get fine? idk what the standard practice would be

@deepyaman
Copy link
Collaborator

should we be raising something if the relevant libraries aren't installed? or is the import error a user would get fine? idk what the standard practice would be

Import error is fine IMO for now. I'm not sure how far we're getting realistically without a backend.

@jitingxu1
Copy link
Collaborator Author

jitingxu1 commented Sep 12, 2024

should we be raising something if the relevant libraries aren't installed? or is the import error a user would get fine? idk what the standard practice would be

Added tests:

  • one test to make the optional dependency non-blocking. User need to install them when they need it, it will throw an ModuleNotFoundError
  • one test to import IbisML (@deepyaman could you check this one too)

@jitingxu1 jitingxu1 requested a review from deepyaman September 12, 2024 20:34
tests/test_init.py Outdated Show resolved Hide resolved
tests/test_init.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you do this in a separate file instead of, say, test_core.py? Is there a practical reason?

If it's in a separate file, I can't see how if "ibis_ml" in sys.modules can even be True entering the tests. Did this not work properly in test_core.py, or you didn't try?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not try puting it in the test_core, the reason I put it in the new file becuase 1) some other files, such as _discretize.py also dependes on numpy 2) It is testing how the dependency impact the IbisML loading. I named it test_dependenct.py before, I am not sure if it is a good name.

tests/test_init.py Outdated Show resolved Hide resolved
@jitingxu1 jitingxu1 requested a review from deepyaman September 13, 2024 16:22
@deepyaman deepyaman changed the title fix: add duckdb as dependency build: make NumPy, pandas, and Arrow deps optional Sep 13, 2024
@deepyaman deepyaman merged commit aa71647 into ibis-project:main Sep 13, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No module named 'numpy' on default install
5 participants