Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deactivate dask string conversion #349

Merged
merged 1 commit into from
Aug 9, 2023

Conversation

RobbeSneyders
Copy link
Member

@RobbeSneyders RobbeSneyders commented Aug 9, 2023

Since version 2023-7-1, Dask DataFrames automatically convert object data types to string[pyarrow] (changelog)

This leads to Dask trying to read image data as strings, which results in the following error:

unicodedecodeerror: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte


I tried several things to force Dask to keep this data as bytes:

  • Write metadata file with schema in to_parquet
  • Cast to bytes explicitly after reading using read_parquet.astype(...)

But none of these worked. I believe this is because pandas represents binary data using the object dtype, which is then automatically converted to string by Dask.


So I believe our only option is to deactivate this behavior completely:

dask.config.set({"dataframe.convert-string": False})

I don't think this has any downside for us, as textual data will still be converted to strings since we explicitly set the schema based on the component specification.

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for me, although we could open an issue on the Dask repo regarding this. If I understand correctly, the issue happens when using read_parquet (i.e. when reading bytes data from the cloud), not when writing? Could it be that we need to provide this info when using read_parquet?

I see there's a dtype_backend argument: https://docs.dask.org/en/stable/generated/dask.dataframe.read_parquet.html

@RobbeSneyders
Copy link
Member Author

They have an issue for feedback on this.

But I'm not sure my understanding is correct yet. I'm doing some local testing but can't reproduce what's happening in the pipeline. When I try to create a minimal reproducible example, I do get the error on writing already.

Will dig a bit deeper, but for now will already merge this PR to resolve the issue.

@RobbeSneyders RobbeSneyders merged commit ecf9e62 into main Aug 9, 2023
5 checks passed
@RobbeSneyders RobbeSneyders deleted the bugfix/dask-string-conversion branch August 9, 2023 14:54
@RobbeSneyders
Copy link
Member Author

Ok, so found out why this was so hard to reproduce. The download_images component has Python 3.8 installed, for which Dask dropped support in 2023.5.1, while the caption_images component has Python 3.10 installed.
This is why we only get the error when reading in the caption_images component, and not when writing in the download_images component.

I think the solution in my PR is still the best way forward. Since otherwise, users would be required to set the correct PyArrow types in their code.

Eg.

meta={0: bytes}

would no longer work (it would be converted to string by Dask) and the user would have to define

import pandas as pdimport pyarrow as pameta={0: pd.ArrowDtype(pa.binary())}

instead.

I don't think we lose any benefits by disabling the string conversion, unless the user doesn't mark their string columns as string. And even then, we set the correct schema based on the component spec every time we write to parquet.

Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
Since version 2023-7-1, Dask DataFrames automatically convert object
data types to string[pyarrow]
([changelog](https://docs.dask.org/en/stable/changelog.html#v2023-7-1))

This leads to `Dask` trying to read image data as strings, which results
in the following error:
> unicodedecodeerror: 'utf-8' codec can't decode byte 0xff in position
0: invalid start byte

---

I tried several things to force Dask to keep this data as bytes:
- Write metadata file with schema in `to_parquet`
- Cast to bytes explicitly after reading using
`read_parquet.astype(...)`

But none of these worked. I believe this is because `pandas` represents
binary data using the `object` dtype, which is then automatically
converted to string by `Dask`.

---

So I believe our only option is to deactivate this behavior completely:

```python
dask.config.set({"dataframe.convert-string": False})
```

I don't think this has any downside for us, as textual data will still
be converted to strings since we explicitly set the schema based on the
component specification.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants