You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
was updated to include the mpsc::channel I believe we lost the guarantee of preserving expected file order. The channel is nice since it introduces backpressure and ensures memory requirements do not grow without bound in case ObjectStore writes are falling behind, but I am not sure how to preserve the ordering of serialized RecordBatches in the channel construct.
To Reproduce
I have not yet verified a specific case where order is not preserved (TODO), but I don't see any reason why it is guarenteed since the channel does not preserve ordering (it will depend on how tokio schedules the tasks).
Expected behavior
We should guarantee that file ordering is preserved regardless of parallelization.
Additional context
No response
The text was updated successfully, but these errors were encountered:
It would be helpful if you could provide a reproducible example.
In the channel, the order-preserving mechanism depends on the serialization task handles that are awaited one by one. Since the handles are created in the order of the data stream, it will preserve the order.
@metesynnada it looks like you are right! I tested writing out large (1GB+) sorted CSV files and they came out sorted correctly over many runs. So it does seem that the tasks are scheduled and placed in the queue in a consistent order. This is somewhat surprising to me, but I suppose it does make sense that tokio tasks are scheduled more consistently vs. threads.
Closing this issue now unless anyone else sees a situation in which sort order is not preserved.
Describe the bug
Initial implementation of #7452 intended to preserve the ordering of rows in CSV/JSON files in case a user runs a query like:
It is reasonable to expect that the CSV should be ordered by my_col. When this function: https://github.com/apache/arrow-datafusion/blob/561e0d7e87825aba224bf2eb9c3b8b5e1b725597/datafusion/core/src/datasource/file_format/write.rs#L310-L393
was updated to include the mpsc::channel I believe we lost the guarantee of preserving expected file order. The channel is nice since it introduces backpressure and ensures memory requirements do not grow without bound in case ObjectStore writes are falling behind, but I am not sure how to preserve the ordering of serialized RecordBatches in the channel construct.
To Reproduce
I have not yet verified a specific case where order is not preserved (TODO), but I don't see any reason why it is guarenteed since the channel does not preserve ordering (it will depend on how tokio schedules the tasks).
Expected behavior
We should guarantee that file ordering is preserved regardless of parallelization.
Additional context
No response
The text was updated successfully, but these errors were encountered: