Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(python): streamline lazy imports #5302

Merged
merged 1 commit into from
Oct 24, 2022
Merged

refactor(python): streamline lazy imports #5302

merged 1 commit into from
Oct 24, 2022

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Oct 22, 2022

Maintains the fantastic new import speedups (tuna graph below to validate that claim ;) but also preserves more natural interaction with the lazy-loaded modules - import pandas, numpy, pyarrow, etc from polars.dependencies and then treat like any other module. (I think the name change from "import_check" to "dependencies" makes sense?)

Comparison:

  • Before

    from polars.import_check import _NUMPY_AVAILABLE, lazy_isinstance, numpy_mod
    if _NUMPY_AVAILABLE and lazy_isinstance(
        item, "numpy", lambda: numpy_mod().ndarray
    ):
        import numpy as np
        item = cast("np.ndarray[Any, Any]", item)
        ...
  • After (all lazy features preserved)

    from polars.dependencies import _NUMPY_TYPE, numpy as np
    if _NUMPY_TYPE(item) and isinstance(item, np.ndarray):
        ...

    eg: polars/internals/dataframe/frame.py:1377 -
    image

Performance:

Attribute access into lazy-loaded modules should be (marginally) faster; after first access they exist in sys.modules and aren't mediated by a function call containing the import directive. Import speeds match the previous patch.

Bonus:

Able to remove module imports from TYPE_CHECKINGblock in quite a few modules (as this gets centralised in polars.dependencies).

Import times:

No regressions -

polars import cost


Todo:

  • take care of remaining lint
  • sphinx docs signature issue with from_pandas
  • "loader must define exec_module()" error coming from Check import without optional dependencies CI pass
  • don't lose lazy_isinstance behaviour (where module isn't imported unless the type may actually match)

Update: done...

Misc: docs requirements should also include modules that are referenced (pandas/numpy/pyarrow) - added.

@github-actions github-actions bot added python Related to Python Polars refactor Code improvements that do not change functionality labels Oct 22, 2022
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Oct 22, 2022

Looks like some typing-related lint, a complaint from sphinx docs, and a "must define exec_module()" gremlin. Will dig-in and take care of it; all looks solvable ;)

@ghuls
Copy link
Collaborator

ghuls commented Oct 22, 2022

Will those isinstance lines, not just import the module if it is available? The lazy instance way only would import it, when the data object would be an np.ndarray. Maybe I don't understand the code, but it looks to me that the import of polars might be fast, but calling any function which has a *_Available, section will cause importing of modules, even when not needed (depending on the arguments).

elif _NUMPY_AVAILABLE and isinstance(data, np.ndarray):

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Oct 22, 2022

Will those isinstance lines, not just import the module if it is available? The lazy instance way only would import it, when the data object would be an np.ndarray. Maybe I don't understand the code, but it looks to me that the import of polars might be fast, but calling any function which has a *_Available, section will cause importing of modules, even when not needed (depending on the arguments).

elif _NUMPY_AVAILABLE and isinstance(data, np.ndarray):

Ahh, I get what you mean - I'll see if it's possible to black-magic it, or maybe tweak/reinstate lazy_isinstance; added to the "todo" above (and tweaked the summary to mention that).

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Oct 22, 2022

@ghuls: solved, without having to reinstate lazy_isinstance; just needed a large cup of tea and a fresh take on it. New pattern (preserving all the laziness of the original) looks like this - nice and clean:

if _NUMPY_TYPE(data) and isinstance(data, np.ndarray):

The _NUMPY_TYPE func combines _NUMPY_AVAILABLE with the non-importing check on the object type's string repr. If it doesn't pass that, the isinstance check is never triggered and the module doesn't import as no attribute access occurs (as you'd expect, there are also corresponding _PYARROW_TYPE and _PANDAS_TYPE functions).

Just need to address the sphinx issue (determining from_pandas signature) and then it's good to go.
(Very late here now, so will take care of that tomorrow AM!)

Update: done - all lint/docs/tests happy...

@ritchie46
Copy link
Member

Thanks @alexander-beedie. I think it is good to separate the lazy string type checking from isinstance indeed. At least mypy likes it better. And cool that python already has lazy imports!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Related to Python Polars refactor Code improvements that do not change functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants