Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Parquet] ParquetWriter reports incorrect schema when specified at wrong order #39241

Open
rongcuid opened this issue Dec 15, 2023 · 6 comments

Comments

@rongcuid
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

When I try to write a table to a ParquetWriter, specifying schema in a different order gives an error:

ValueError: Table schema does not match schema used to create file: 
table:
cmdline: list<item: string>
  child 0, item: string
time: double
lock: string vs. 
file:
time: double
cmdline: list<item: string>
  child 0, item: string
lock: string

Component(s)

Parquet, Python

@mapleFU
Copy link
Member

mapleFU commented Dec 15, 2023

Table schema does not match schema used to create file

Seems this report the reason. The table should same schema as file.

If you want you can add a select to reorder the column, this could be light weight because I believe it would not copy the underlying ChunkedArray: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.select

@rongcuid
Copy link
Author

rongcuid commented Dec 15, 2023

I think Parquet spec doesn't require the schema to have same order. Did I misunderstand the spec?

EDIT: https://issues.apache.org/jira/browse/PARQUET-188 OK, maybe I misread the spec

@rongcuid
Copy link
Author

Though I linked to the issue, I am unable to actually find any mention of order in specs, however.

@mapleFU
Copy link
Member

mapleFU commented Dec 16, 2023

Emmmm as a file format, parquet cannot reordering, it needs to maintain all schema in a file as same. So in parquet standard, it cannot reorder within a file.

During library writing to parquet, we don't support reorder now because handling this in parquet writer might be tricky. So I think a select as a projection would be convinient, efficient and ok here.

@rongcuid
Copy link
Author

I see one comment in a source file saying that it is ordered. Could this info be added to the API documentation?

@kou kou changed the title [Python][Parquet]ParquetWriter reports incorrect schema when specified at wrong order [Python][Parquet] ParquetWriter reports incorrect schema when specified at wrong order Dec 17, 2023
@mapleFU
Copy link
Member

mapleFU commented Dec 18, 2023

If you think it's confusion to not allow ordering, I think we can add this info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants