-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
polars.EagerPolarsDataset fails to read parquet #590
Comments
It's because Kedro opens the file and somehow doesn't recognize it as a bytes / io.BufferedIOBase / io.RawIOBase (or Polars doesn't) and therefore sends it to If you do this is load_args:
use_pyarrow: true Kedro sends it off to Polars here: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/kedro_datasets/polars/eager_polars_dataset.py#L156 And here Polars checks if the file is already opened: https://github.com/pola-rs/polars/blob/py-0.20.13/py-polars/polars/io/parquet/functions.py#L156 Then Polars decides that it isn't opened and goes here: https://github.com/pola-rs/polars/blob/py-0.20.13/py-polars/polars/io/parquet/functions.py#L171 and it goes to Either it is a Kedro bug or a Polars bug but I don't know which. @astrojuanlu Apparently this has already been reported. |
Thanks @grofte and sorry @mark-druffel for the delay, this fell off my radar. I'm labeling this accordingly and adding it to our backlog. |
@grofte I could be doing it wrong, but I'm still getting an error with pyarrow. @astrojuanlu Definitely no rush on my side, this isn't holding anything up for my team. Just wanted to report in case it was helpful / uknown.
I ran
|
@mark-druffel I thought it was to same thing as I ran into so I didn't read your error messages closely enough. Sorry. |
If you wanted it to work with Eager I think you would have to use the dataset factory https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html and then in your pipeline find out what files were there, create a node for each of them and then feed them into a node with a vstack to join them. Much better to use the Lazy dataset ingestion and then call |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
@grofte @mark-druffel finally spent some time reflecting on this. Given that the underlying Thoughts? |
I think it would be nice with a method that checks load/save arguments on the Kedro classes since Kedro is a layer of indirection in front of Polars. To inform the user better when they hit these gotchas (path globbing and dtypes). I know you could do this in documentation but that also seems awkward. But I don't think eager should support globbing when Polars doesn't. |
Sorry I'm not sure I follow the point on polars not globbing, probably because I have some flawed understanding of what's happening under the hood. Both these work for me in python environment, is this not what's happening with the two kedro datasets?
@astrojuanlu Regardless of your response, I'm happy and unblocked. I would lean on your team with respect to what direction makes the most sense, I'm still too much of a novice (w/ kedro & polars) to have an authoritative opinion on what the code should do. Just let me know if you want me to close this issue and thanks! |
(edited). There are some discussion of # This work
import polars as pl
path = '/workspace/kedro-plugins/polar-eager/data/03_primary/model_input_table.pq/*.parquet'
pl.read_parquet(path)
# This fail
catalog.load("model_input_table") # an eager dataset with filepath: '/workspace/kedro-plugins/polar-eager/data/03_primary/model_input_table.pq/*.parquet' There are claims that polars doesn't work with kedro-plugins/kedro-datasets/kedro_datasets/polars/eager_polars_dataset.py Lines 157 to 158 in 65f28b5
If you change this in the return load_method(load_path, **self._load_args) |
@mark-druffel If LazyDataset works with cloud buckets and globbing and Eager doesn't with globbing then I guess Kedro could just invoke scan and run .collect() before returning it? Maybe Eager could just inherit Lazy? |
Thanks for investigating @noklam 💯 As @grofte , I was also confused by the lack of mentions to globbing in the As far as I remember, Polars supports remote paths with its own native methods. Maybe we should ditch fsspec for this particular dataset? |
This is what we do with spark. If the support is good enough we can definitely do this. Though we have to be careful, |
Yeah when I say "ditch fsspec for this particular dataset", I mean our specific boilerplate on the datasets code. I don't mind what mechanism does the underlying library use. |
It was noted in backlog grooming today by @noklam that our |
Description
Trying to load a parquet w/ polars.EagerPolarsDataset with data catalog results in an error saying
no such file or directory
:Context
I'm working on a new pipeline, bug is brand new to me. I'm trying to load parquets into my pipeline using polars. I realized lazy polars seems to work fine so this isn't preventing me from doing anything at the moment.
Steps to Reproduce
I reproduced the issue with the kedro starter pipeline in a conda environment.
kedro_new --name polars_issue --tools all --example y
kedro run
to populate data/.Expected Result
The code below should work the same as `pl.read_parquet() and LazyPolarsDataset did in steps to reproduce.
Actual Result
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro
orkedro -V
): kedro, version 0.19.3pip show kedro-airflow
): kedro-datasets 2.1.0python -V
): Python 3.10.13The text was updated successfully, but these errors were encountered: