pyarrow.dataset.write_dataset do not preserve order #39030

xquyvu · 2023-12-01T14:13:45Z

Describe the bug, including details regarding any error messages, version, and platform.

As described, when writing a file with pyarrow.dataset.write_dataset, the order is not preserved. I have tested this with both parquet and csv file format.

import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import pyarrow.dataset
from pathlib import Path


data_load_path = './data.parquet'
pyarrow_dataset_write_path = './pyarrow_saved_data.parquet'

data = pd.DataFrame({'col': np.arange(1e7)})
data.to_parquet(data_load_path)

# Check if data loaded with pandas and pyarrow are the same
pyarrow_dataset = pyarrow.dataset.dataset(data_load_path, format='parquet')
pyarrow_dataset_df = pyarrow_dataset.to_table().to_pandas()

print((pyarrow_dataset_df['col'] == data['col']).all()) # True

# Write with pyarrow.dataset.write_dataset
pyarrow.dataset.write_dataset(
    pyarrow_dataset,
    pyarrow_dataset_write_path,
    format='parquet',
)

loaded_pyarrow_dataset = pyarrow.dataset.dataset(pyarrow_dataset_write_path, format='parquet')
loaded_pyarrow_dataset_df = loaded_pyarrow_dataset.to_table().to_pandas()
print((loaded_pyarrow_dataset_df['col'] == data['col']).all()) # False
print((loaded_pyarrow_dataset_df['col'] == data['col']).mean()) # 0.29

# Write with pq.write_to_dataset
pq.write_to_dataset(
    pyarrow_dataset,
    'x.parquet',
    existing_data_behavior='delete_matching'
)

(pyarrow.dataset.dataset('x.parquet').to_table().to_pandas()['col'] == data['col']).all() # True

Component(s)

Python

The text was updated successfully, but these errors were encountered:

mapleFU · 2023-12-01T15:20:11Z

Just curious, does to_parquet gurantee the ordering?

xquyvu · 2023-12-01T15:30:28Z

Just curious, does to_parquet gurantee the ordering?

yes.

xquyvu · 2024-01-09T17:52:28Z

Hello any updates on this? Thanks!

u3Izx9ql7vW4 · 2024-07-12T20:54:33Z

Interested in this as well. Would be great if there was a way to ensure ordering for datasets

…s, because (as of now) it does not preserve ordering on a filesystem write. apache/arrow#26818 apache/arrow#39030

csantosbh · 2024-11-20T04:11:18Z

This seems like quite a serious bug - a good portion of my tools rely on the assumption that write_dataset preserves the original data order. Are there workarounds for this?

xquyvu · 2024-11-20T12:14:23Z

I finally got around this by adding an index column to the data. Not great, but it makes the pipeline much more robust.

adamreeve · 2024-11-20T22:38:48Z

Are there workarounds for this?

I believe you can work around this by setting use_threads=False in write_dataset

This issue looks like a duplicate of #26818, which @EnricoMi is currently working on fixing in #44470

csantosbh · 2024-12-13T12:18:34Z

Thanks for the suggestion, Adam. It seems that indeed use_threads=False helps, though in my tests it didn't fully preserve the original dataset order. There's a chance some external factor led to the behavior I observed, so I'll try to reproduce with a minimal testcase when I have some time.
Also, thanks for pointing out the relevant issue/PR, that will be great for visibility 👍

xquyvu added the Type: bug label Dec 1, 2023

github-actions bot added the Component: Python label Dec 1, 2023

mikeburkat mentioned this issue Jul 12, 2024

Write monotonic sequence, but read is non monotonic delta-io/delta-rs#2659

Closed

This was referenced Jul 12, 2024

[C++][Dataset] Preserve order when writing dataset #26818

Open

[Python] Dataset sorting_columns support request #43239

Open

ds283 added a commit to ds283/SecondaryGWKit that referenced this issue Oct 4, 2024

Add comment to document that sorting in PyArrow is currently pointles…

b6d924c

…s, because (as of now) it does not preserve ordering on a filesystem write. apache/arrow#26818 apache/arrow#39030

gitmodimo mentioned this issue Oct 9, 2024

GH-41706: [C++][Acero] Enhance asof_join to work in multi-threaded execution by sequencing input #44083

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyarrow.dataset.write_dataset do not preserve order #39030

pyarrow.dataset.write_dataset do not preserve order #39030

xquyvu commented Dec 1, 2023 •

edited

Loading

mapleFU commented Dec 1, 2023

xquyvu commented Dec 1, 2023

xquyvu commented Jan 9, 2024 •

edited

Loading

u3Izx9ql7vW4 commented Jul 12, 2024

csantosbh commented Nov 20, 2024 •

edited

Loading

xquyvu commented Nov 20, 2024

adamreeve commented Nov 20, 2024

csantosbh commented Dec 13, 2024

pyarrow.dataset.write_dataset do not preserve order #39030

pyarrow.dataset.write_dataset do not preserve order #39030

Comments

xquyvu commented Dec 1, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

mapleFU commented Dec 1, 2023

xquyvu commented Dec 1, 2023

xquyvu commented Jan 9, 2024 • edited Loading

u3Izx9ql7vW4 commented Jul 12, 2024

csantosbh commented Nov 20, 2024 • edited Loading

xquyvu commented Nov 20, 2024

adamreeve commented Nov 20, 2024

csantosbh commented Dec 13, 2024

xquyvu commented Dec 1, 2023 •

edited

Loading

xquyvu commented Jan 9, 2024 •

edited

Loading

csantosbh commented Nov 20, 2024 •

edited

Loading