Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow.dataset.write_dataset do not preserve order #39030

Open
xquyvu opened this issue Dec 1, 2023 · 8 comments
Open

pyarrow.dataset.write_dataset do not preserve order #39030

xquyvu opened this issue Dec 1, 2023 · 8 comments

Comments

@xquyvu
Copy link

xquyvu commented Dec 1, 2023

Describe the bug, including details regarding any error messages, version, and platform.

As described, when writing a file with pyarrow.dataset.write_dataset, the order is not preserved. I have tested this with both parquet and csv file format.

import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import pyarrow.dataset
from pathlib import Path


data_load_path = './data.parquet'
pyarrow_dataset_write_path = './pyarrow_saved_data.parquet'

data = pd.DataFrame({'col': np.arange(1e7)})
data.to_parquet(data_load_path)

# Check if data loaded with pandas and pyarrow are the same
pyarrow_dataset = pyarrow.dataset.dataset(data_load_path, format='parquet')
pyarrow_dataset_df = pyarrow_dataset.to_table().to_pandas()

print((pyarrow_dataset_df['col'] == data['col']).all()) # True

# Write with pyarrow.dataset.write_dataset
pyarrow.dataset.write_dataset(
    pyarrow_dataset,
    pyarrow_dataset_write_path,
    format='parquet',
)

loaded_pyarrow_dataset = pyarrow.dataset.dataset(pyarrow_dataset_write_path, format='parquet')
loaded_pyarrow_dataset_df = loaded_pyarrow_dataset.to_table().to_pandas()
print((loaded_pyarrow_dataset_df['col'] == data['col']).all()) # False
print((loaded_pyarrow_dataset_df['col'] == data['col']).mean()) # 0.29

# Write with pq.write_to_dataset
pq.write_to_dataset(
    pyarrow_dataset,
    'x.parquet',
    existing_data_behavior='delete_matching'
)

(pyarrow.dataset.dataset('x.parquet').to_table().to_pandas()['col'] == data['col']).all() # True

Component(s)

Python

@mapleFU
Copy link
Member

mapleFU commented Dec 1, 2023

Just curious, does to_parquet gurantee the ordering?

@xquyvu
Copy link
Author

xquyvu commented Dec 1, 2023

Just curious, does to_parquet gurantee the ordering?

yes.

@xquyvu
Copy link
Author

xquyvu commented Jan 9, 2024

Hello any updates on this? Thanks!

@u3Izx9ql7vW4
Copy link

Interested in this as well. Would be great if there was a way to ensure ordering for datasets

@csantosbh
Copy link

csantosbh commented Nov 20, 2024

This seems like quite a serious bug - a good portion of my tools rely on the assumption that write_dataset preserves the original data order. Are there workarounds for this?

@xquyvu
Copy link
Author

xquyvu commented Nov 20, 2024

I finally got around this by adding an index column to the data. Not great, but it makes the pipeline much more robust.

@adamreeve
Copy link
Contributor

Are there workarounds for this?

I believe you can work around this by setting use_threads=False in write_dataset

This issue looks like a duplicate of #26818, which @EnricoMi is currently working on fixing in #44470

@csantosbh
Copy link

Thanks for the suggestion, Adam. It seems that indeed use_threads=False helps, though in my tests it didn't fully preserve the original dataset order. There's a chance some external factor led to the behavior I observed, so I'll try to reproduce with a minimal testcase when I have some time.
Also, thanks for pointing out the relevant issue/PR, that will be great for visibility 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants