-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check schema compatibility when building parquet readers #5434
Check schema compatibility when building parquet readers #5434
Conversation
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
LGTM |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Outdated
Show resolved
Hide resolved
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Outdated
Show resolved
Hide resolved
// } | ||
// TODO: Add below converters for INT32. Converters work when evolving schema over cuDF | ||
// table read from Parquet file. https://github.com/NVIDIA/spark-rapids/issues/5445 | ||
if (dt == DataTypes.ByteType || dt == DataTypes.ShortType || dt == DataTypes.DateType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added downcast converters for INT_32 in this PR to close #5445, since UT cases of parquet writing fails if these combinations are disabled.
build |
build |
build |
1 similar comment
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restore the approval
Fixes #5200 #5445
Currently, there exists no check of schema compatibility between file schema (parquet type) and read schema (spark type) when building GPU parquet readers. Thus, reading parquet data with incompatilble spark types is actually allowed on the GPU (such as: reading int as long), which leads to a lot of underfined behaviors.
The purpose of this PR is to add the schema check, referring the checking process of Spark. Converters downcasting INT32 to Byte/Short/Date are added in this PR as well.
Note: This PR uses some deprecated API of parquet-mr, in order to accommodate Spark 3.1.
Signed-off-by: sperlingxx lovedreamf@gmail.com