-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deactivate dask string conversion #349
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good for me, although we could open an issue on the Dask repo regarding this. If I understand correctly, the issue happens when using read_parquet
(i.e. when reading bytes data from the cloud), not when writing? Could it be that we need to provide this info when using read_parquet
?
I see there's a dtype_backend
argument: https://docs.dask.org/en/stable/generated/dask.dataframe.read_parquet.html
They have an issue for feedback on this. But I'm not sure my understanding is correct yet. I'm doing some local testing but can't reproduce what's happening in the pipeline. When I try to create a minimal reproducible example, I do get the error on writing already. Will dig a bit deeper, but for now will already merge this PR to resolve the issue. |
Ok, so found out why this was so hard to reproduce. The download_images component has Python 3.8 installed, for which Dask dropped support in 2023.5.1, while the caption_images component has Python 3.10 installed. I think the solution in my PR is still the best way forward. Since otherwise, users would be required to set the correct PyArrow types in their code. Eg.
would no longer work (it would be converted to string by Dask) and the user would have to define
instead. I don't think we lose any benefits by disabling the string conversion, unless the user doesn't mark their string columns as string. And even then, we set the correct schema based on the component spec every time we write to parquet. |
Since version 2023-7-1, Dask DataFrames automatically convert object data types to string[pyarrow] ([changelog](https://docs.dask.org/en/stable/changelog.html#v2023-7-1)) This leads to `Dask` trying to read image data as strings, which results in the following error: > unicodedecodeerror: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte --- I tried several things to force Dask to keep this data as bytes: - Write metadata file with schema in `to_parquet` - Cast to bytes explicitly after reading using `read_parquet.astype(...)` But none of these worked. I believe this is because `pandas` represents binary data using the `object` dtype, which is then automatically converted to string by `Dask`. --- So I believe our only option is to deactivate this behavior completely: ```python dask.config.set({"dataframe.convert-string": False}) ``` I don't think this has any downside for us, as textual data will still be converted to strings since we explicitly set the schema based on the component specification.
Since version 2023-7-1, Dask DataFrames automatically convert object data types to string[pyarrow] (changelog)
This leads to
Dask
trying to read image data as strings, which results in the following error:I tried several things to force Dask to keep this data as bytes:
to_parquet
read_parquet.astype(...)
But none of these worked. I believe this is because
pandas
represents binary data using theobject
dtype, which is then automatically converted to string byDask
.So I believe our only option is to deactivate this behavior completely:
I don't think this has any downside for us, as textual data will still be converted to strings since we explicitly set the schema based on the component specification.