-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(datasets): lazily load datasets in init files #277
Conversation
741f2b4
to
5d9db24
Compare
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
a1e8d15
to
c8c3541
Compare
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
6b42018
to
7d22db1
Compare
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
a74ec81
to
587b2c2
Compare
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Trying to link issues properly. Wonder how different is this approach compare to kedro-org/kedro#2702. Does this also affect Conceal tracebacks for managed exceptions#2401? In addition, Kedro's CLI is quite slow especially with plugins installed. Potentially helps with this too? kedro-org/kedro#1476 I also wonder could it be useful for lazy loading pipelines? Sorry for lots of questions, I haven't had time to play with the library yet, just some quick thoughts coming out top of my head. |
This comment was marked as duplicate.
This comment was marked as duplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! This library is a great find. I think the error message has also massively improved 👍
kedro-org/kedro-viz#1159 |
It's quite similar. The code for
Not as far as I can understand from that writeup, but I could be missing something.
Potentially yes, I think so.
I think we need to be careful around how exactly we implement this, since we may not want to delay flagging potential problems by lazy importing.
|
Quite possibly can help, yes, without thinking too much about it. :) |
* perf(datasets): lazily load datasets in init files (api) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (pandas) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * fix(datasets): fix no name in module in api/pandas Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (biosequence) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (dask) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (databricks) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (email) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (geopandas) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (holoviews) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (json) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * fix(datasets): resolve "too few public attributes" Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (matplotlib) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (networkx) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (pickle) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (pillow) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (plotly) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (polars) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (redis) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (snowflake) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (spark) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (svmlight) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (tensorflow) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (text) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (tracking) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (video) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * perf(datasets): lazily load datasets in init files (yaml) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Update RELEASE.md --------- Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Description
pandas.ExcelDataSet
has dependencies (that are installed) that are very expensive to load, but you only needpandas.CSVDataSet
, you won't spend time importingpandas.ExcelDataSet
dependencies.pandas.ExcelDataSet
installed, you should get an error corresponding to the missing modules, not a cryptic message saying that theExcelDataSet
attribute doesn't exist.Development notes
Valid import behavior
On
perf/datasets/lazy-loader
branch:Output of
python -X importtime -c'import kedro_datasets.pandas;kedro_datasets.pandas.CSVDataSet'
:On
main
branch:Output of
python -X importtime -c'import kedro_datasets.pandas;kedro_datasets.pandas.CSVDataSet'
:Note the import of everything in
kedro_datasets.pandas
. I happen to have SQLAlchemy installed in this test env (testing #281), so all those imports also trigger. You can easily infer that this is going to be even slower if you have a lot of extra dependencies in your env, that let you successfully import each of these dataset modules, even though they're not needed.Invalid import behavior
On
perf/datasets/lazy-loader
branch:Output of
python -X importtime -c'import kedro_datasets.spark;kedro_datasets.spark.SparkDataSet'
:On
main
branch:Output of
python -X importtime -c'import kedro_datasets.spark;kedro_datasets.spark.SparkDataSet'
:Note that
kedro_datasets.spark.spark_dataset
,kedro_datasets.spark.spark_hive_dataset
,kedro_datasets.spark.spark_jdbc_dataset
,kedro_datasets.spark.deltatable_dataset
, andkedro_datasets.spark.spark_streaming_dataset
were all attempted to be imported. Also note that the resulting error (AttributeError: module 'kedro_datasets.spark' has no attribute 'SparkDataSet'
) is less helpful, and arguably inaccurate.Checklist
RELEASE.md
file