Fix count() in avro failed when reader_types is coalescing #6225

thirtiseven · 2022-08-04T08:42:46Z

spark.read.format("avro").load(data_path).count() reports error: QueryExecutionException: Expected 0 columns but read 8 from ArrayBuffer, if reader_types is COALESCING. It is because Avro reader always specifies the schema from a file when coalescing reading, causing it to always read all columns of data.

This PR use readDataSchema instead the schema from a file to quick fix this bug.

Further, we should build an evolved schema from readschema and dataSchema to do type checking and filtering, just like orc reader and parquet reader do. We also use readschema when reading files from cudf now, so it will be a bit complicated to make this change. Filed a followed issue #6226 to track it.

Fixes #6131

Signed-off-by: thirtiseven ntlihy@gmail.com

Signed-off-by: thirtiseven <ntlihy@gmail.com>

thirtiseven · 2022-08-04T08:56:32Z

build

firestarman

Better to file a follow-up issue for building the evolved schema you mentioned.

integration_tests/src/main/python/avro_test.py

Signed-off-by: thirtiseven <ntlihy@gmail.com>

thirtiseven · 2022-08-04T09:21:37Z

Filed a follow up issue issue6226.

sameerz · 2022-08-04T17:46:33Z

@GaryShen2008 is this for 22.08? If so, it will need to be approved and merged by Friday Aug 5.

firestarman

LGTM

Fix count() in avro failed when reader_types is coalescing

891187f

Signed-off-by: thirtiseven <ntlihy@gmail.com>

firestarman reviewed Aug 4, 2022

View reviewed changes

integration_tests/src/main/python/avro_test.py Outdated Show resolved Hide resolved

simplify test

162516b

Signed-off-by: thirtiseven <ntlihy@gmail.com>

sameerz added the bug Something isn't working label Aug 4, 2022

thirtiseven requested a review from firestarman August 5, 2022 01:10

firestarman approved these changes Aug 5, 2022

View reviewed changes

thirtiseven marked this pull request as ready for review August 5, 2022 01:51

thirtiseven merged commit bea7fc6 into NVIDIA:branch-22.08 Aug 5, 2022

thirtiseven self-assigned this Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix count() in avro failed when reader_types is coalescing #6225

Fix count() in avro failed when reader_types is coalescing #6225

thirtiseven commented Aug 4, 2022 •

edited

Loading

thirtiseven commented Aug 4, 2022

firestarman left a comment

thirtiseven commented Aug 4, 2022

sameerz commented Aug 4, 2022

firestarman left a comment

Fix count() in avro failed when reader_types is coalescing #6225

Fix count() in avro failed when reader_types is coalescing #6225

Conversation

thirtiseven commented Aug 4, 2022 • edited Loading

thirtiseven commented Aug 4, 2022

firestarman left a comment

Choose a reason for hiding this comment

thirtiseven commented Aug 4, 2022

sameerz commented Aug 4, 2022

firestarman left a comment

Choose a reason for hiding this comment

thirtiseven commented Aug 4, 2022 •

edited

Loading