-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Schema error while writing Parquet files #48102
Comments
@bveeramani inferred schema has to be enforced on all blocks, if any block doesn't adhere we have to fail and defer to user either provide it explicitly. There are obvious caveat for nullability -- if type only differ in nulls, we'd assume that the column is nullable relaxing the schema. |
Yup, this issue is referring the null situation |
Actually - how could a user do this? |
@rickyyx haven't test this out, but you can specify a |
I see, but in the context of this issue, there's no way for user to provide a schema when writing right? E.g. if I have a column that's actually float, but some of the data is in int, the write_parquet() API would still fail while it could technically be okay if users could hint it's actually float. |
Yeah, you're right. If you do something like |
I see- thoughts on extending the |
I think that sounds reasonable, too.
What's the primary concern with inferring the schema? I think the UX might be better if you didn't have to manually specify a schema in this scenario |
Yeah, I think for the nullable columns, auto-inferring is reasonable. But the fundamental issue here is that users might have some implicit assumptions on the actual types of the schema, which currently they have no ways to specify. I am not sure how much this is an actual requirement though - so the current auto-inferring-nullable-columns approach seems reasonable to me. |
Another question on the requirement here: Should we allow auto-appending of null columns? E.g.
Should the unified schema be allowed as I think this is where auto-inferring might not be the best since we don't really know how permissive users want the schema unification to be. |
@rickyyx @bveeramani let's start with a simpler fix here:
|
…es (#48478) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When writing blocks to parquet, there might be blocks with fields that differ ONLY in nullability - by default, this would be rejected since some blocks might have a different schema than the ParquetWriter. However, we could potentially allow it to happen by tweaking the schema. This PR goes through all blocks before writing them to parquet, and merge schemas that differ only in nullability of the fields. It also casts the table to the newly merged schema so that the write could happen. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes #48102 --------- Signed-off-by: rickyx <rickyx@anyscale.com>
…es (ray-project#48478) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When writing blocks to parquet, there might be blocks with fields that differ ONLY in nullability - by default, this would be rejected since some blocks might have a different schema than the ParquetWriter. However, we could potentially allow it to happen by tweaking the schema. This PR goes through all blocks before writing them to parquet, and merge schemas that differ only in nullability of the fields. It also casts the table to the newly merged schema so that the write could happen. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes ray-project#48102 --------- Signed-off-by: rickyx <rickyx@anyscale.com>
…es (ray-project#48478) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When writing blocks to parquet, there might be blocks with fields that differ ONLY in nullability - by default, this would be rejected since some blocks might have a different schema than the ParquetWriter. However, we could potentially allow it to happen by tweaking the schema. This PR goes through all blocks before writing them to parquet, and merge schemas that differ only in nullability of the fields. It also casts the table to the newly merged schema so that the write could happen. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes ray-project#48102 --------- Signed-off-by: rickyx <rickyx@anyscale.com> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
…es (ray-project#48478) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When writing blocks to parquet, there might be blocks with fields that differ ONLY in nullability - by default, this would be rejected since some blocks might have a different schema than the ParquetWriter. However, we could potentially allow it to happen by tweaking the schema. This PR goes through all blocks before writing them to parquet, and merge schemas that differ only in nullability of the fields. It also casts the table to the newly merged schema so that the write could happen. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes ray-project#48102 --------- Signed-off-by: rickyx <rickyx@anyscale.com> Signed-off-by: hjiang <dentinyhao@gmail.com>
What happened + What you expected to happen
I'm writing Parquet files, with the number of rows per file configured. I'm getting the error below:
Versions / Dependencies
2.37
Reproduction script
Issue Severity
None
The text was updated successfully, but these errors were encountered: