Deactivate dask string conversion #349

RobbeSneyders · 2023-08-09T12:50:19Z

Since version 2023-7-1, Dask DataFrames automatically convert object data types to string[pyarrow] (changelog)

This leads to Dask trying to read image data as strings, which results in the following error:

unicodedecodeerror: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I tried several things to force Dask to keep this data as bytes:

Write metadata file with schema in to_parquet
Cast to bytes explicitly after reading using read_parquet.astype(...)

But none of these worked. I believe this is because pandas represents binary data using the object dtype, which is then automatically converted to string by Dask.

So I believe our only option is to deactivate this behavior completely:

dask.config.set({"dataframe.convert-string": False})

I don't think this has any downside for us, as textual data will still be converted to strings since we explicitly set the schema based on the component specification.

NielsRogge

Good for me, although we could open an issue on the Dask repo regarding this. If I understand correctly, the issue happens when using read_parquet (i.e. when reading bytes data from the cloud), not when writing? Could it be that we need to provide this info when using read_parquet?

I see there's a dtype_backend argument: https://docs.dask.org/en/stable/generated/dask.dataframe.read_parquet.html

RobbeSneyders · 2023-08-09T14:54:33Z

They have an issue for feedback on this.

But I'm not sure my understanding is correct yet. I'm doing some local testing but can't reproduce what's happening in the pipeline. When I try to create a minimal reproducible example, I do get the error on writing already.

Will dig a bit deeper, but for now will already merge this PR to resolve the issue.

RobbeSneyders · 2023-08-09T16:07:00Z

Ok, so found out why this was so hard to reproduce. The download_images component has Python 3.8 installed, for which Dask dropped support in 2023.5.1, while the caption_images component has Python 3.10 installed.
This is why we only get the error when reading in the caption_images component, and not when writing in the download_images component.

I think the solution in my PR is still the best way forward. Since otherwise, users would be required to set the correct PyArrow types in their code.

Eg.

meta={0: bytes}

would no longer work (it would be converted to string by Dask) and the user would have to define

import pandas as pdimport pyarrow as pameta={0: pd.ArrowDtype(pa.binary())}

instead.

I don't think we lose any benefits by disabling the string conversion, unless the user doesn't mark their string columns as string. And even then, we set the correct schema based on the component spec every time we write to parquet.

Since version 2023-7-1, Dask DataFrames automatically convert object data types to string[pyarrow] ([changelog](https://docs.dask.org/en/stable/changelog.html#v2023-7-1)) This leads to `Dask` trying to read image data as strings, which results in the following error: > unicodedecodeerror: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte --- I tried several things to force Dask to keep this data as bytes: - Write metadata file with schema in `to_parquet` - Cast to bytes explicitly after reading using `read_parquet.astype(...)` But none of these worked. I believe this is because `pandas` represents binary data using the `object` dtype, which is then automatically converted to string by `Dask`. --- So I believe our only option is to deactivate this behavior completely: ```python dask.config.set({"dataframe.convert-string": False}) ``` I don't think this has any downside for us, as textual data will still be converted to strings since we explicitly set the schema based on the component specification.

Deactivate dask string conversion

08d0325

RobbeSneyders requested review from GeorgesLorre, PhilippeMoussalli and NielsRogge August 9, 2023 12:50

GeorgesLorre approved these changes Aug 9, 2023

View reviewed changes

NielsRogge approved these changes Aug 9, 2023

View reviewed changes

RobbeSneyders merged commit ecf9e62 into main Aug 9, 2023
5 checks passed

RobbeSneyders deleted the bugfix/dask-string-conversion branch August 9, 2023 14:54

PhilippeMoussalli mentioned this pull request Aug 23, 2023

Bugfix data-explorer images #382

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deactivate dask string conversion #349

Deactivate dask string conversion #349

RobbeSneyders commented Aug 9, 2023 •

edited

Loading

NielsRogge left a comment •

edited

Loading

RobbeSneyders commented Aug 9, 2023

RobbeSneyders commented Aug 9, 2023

Deactivate dask string conversion #349

Deactivate dask string conversion #349

Conversation

RobbeSneyders commented Aug 9, 2023 • edited Loading

NielsRogge left a comment • edited Loading

Choose a reason for hiding this comment

RobbeSneyders commented Aug 9, 2023

RobbeSneyders commented Aug 9, 2023

RobbeSneyders commented Aug 9, 2023 •

edited

Loading

NielsRogge left a comment •

edited

Loading