Check schema compatibility when building parquet readers #5434

sperlingxx · 2022-05-07T08:00:31Z

Currently, there exists no check of schema compatibility between file schema (parquet type) and read schema (spark type) when building GPU parquet readers. Thus, reading parquet data with incompatilble spark types is actually allowed on the GPU (such as: reading int as long), which leads to a lot of underfined behaviors.

The purpose of this PR is to add the schema check, referring the checking process of Spark. Converters downcasting INT32 to Byte/Short/Date are added in this PR as well.

Note: This PR uses some deprecated API of parquet-mr, in order to accommodate Spark 3.1.

Signed-off-by: sperlingxx lovedreamf@gmail.com

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

…arquet

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

wbo4958 · 2022-05-09T02:02:35Z

LGTM

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

sperlingxx · 2022-05-10T08:10:45Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

sperlingxx · 2022-05-11T02:42:05Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

-        // }
+        // TODO: Add below converters for INT32. Converters work when evolving schema over cuDF
+        //  table read from Parquet file. https://github.com/NVIDIA/spark-rapids/issues/5445
+         if (dt == DataTypes.ByteType || dt == DataTypes.ShortType || dt == DataTypes.DateType) {


I added downcast converters for INT_32 in this PR to close #5445, since UT cases of parquet writing fails if these combinations are disabled.

sperlingxx · 2022-05-11T02:45:45Z

build

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

sperlingxx · 2022-05-11T07:48:18Z

build

…arquet

sperlingxx · 2022-05-12T01:48:26Z

build

sperlingxx · 2022-05-12T01:50:32Z

build

firestarman

restore the approval

sperlingxx added 5 commits May 6, 2022 10:54

init

c276eb5

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

fix

b95031e

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

update

0a1af40

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

Merge remote-tracking branch 'origin/branch-22.06' into tpe_check_4_p…

d835e63

…arquet

fix

35220cd

sperlingxx requested review from wbo4958 and jlowe May 7, 2022 08:00

sperlingxx added 2 commits May 7, 2022 17:26

use backward API to adapt spark 3.1

6646fe6

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

suppress warnings

35135a1

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

sperlingxx mentioned this pull request May 7, 2022

[FEA] Replace deprecated APIs used to check parquet schema compatible #5435

Open

jlowe reviewed May 9, 2022

View reviewed changes

sameerz added the bug Something isn't working label May 10, 2022

refine

9089800

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

jlowe reviewed May 10, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Outdated Show resolved Hide resolved

fix

2b0da04

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

sperlingxx commented May 11, 2022

View reviewed changes

sperlingxx added 2 commits May 11, 2022 15:34

update

14ae628

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

update

490214a

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

jlowe previously approved these changes May 11, 2022

View reviewed changes

sperlingxx added 2 commits May 12, 2022 09:43

Merge remote-tracking branch 'origin/branch-22.06' into tpe_check_4_p…

c56e641

…arquet

Merge remote-tracking branch 'origin/branch-22.06' into tpe_check_4_p…

9ff857f

…arquet

sperlingxx dismissed jlowe’s stale review via 9ff857f May 12, 2022 01:46

firestarman approved these changes May 12, 2022

View reviewed changes

sperlingxx merged commit e41a6f3 into NVIDIA:branch-22.06 May 12, 2022

sperlingxx deleted the tpe_check_4_parquet branch May 12, 2022 06:24

sameerz added this to the May 2 - May 20 milestone May 12, 2022

pxLi mentioned this pull request May 13, 2022

[BUG] test_parquet_check_schema_compatibility failed in databricks runtimes #5481

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check schema compatibility when building parquet readers #5434

Check schema compatibility when building parquet readers #5434

sperlingxx commented May 7, 2022 •

edited

Loading

wbo4958 commented May 9, 2022

sperlingxx commented May 10, 2022

sperlingxx May 11, 2022 •

edited

Loading

sperlingxx commented May 11, 2022

sperlingxx commented May 11, 2022

sperlingxx commented May 12, 2022

sperlingxx commented May 12, 2022

firestarman left a comment

Check schema compatibility when building parquet readers #5434

Check schema compatibility when building parquet readers #5434

Conversation

sperlingxx commented May 7, 2022 • edited Loading

wbo4958 commented May 9, 2022

sperlingxx commented May 10, 2022

sperlingxx May 11, 2022 • edited Loading

Choose a reason for hiding this comment

sperlingxx commented May 11, 2022

sperlingxx commented May 11, 2022

sperlingxx commented May 12, 2022

sperlingxx commented May 12, 2022

firestarman left a comment

Choose a reason for hiding this comment

sperlingxx commented May 7, 2022 •

edited

Loading

sperlingxx May 11, 2022 •

edited

Loading