Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix count() in avro failed when reader_types is coalescing #6225

Merged
merged 2 commits into from
Aug 5, 2022

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Aug 4, 2022

spark.read.format("avro").load(data_path).count() reports error: QueryExecutionException: Expected 0 columns but read 8 from ArrayBuffer, if reader_types is COALESCING. It is because Avro reader always specifies the schema from a file when coalescing reading, causing it to always read all columns of data.

This PR use readDataSchema instead the schema from a file to quick fix this bug.

Further, we should build an evolved schema from readschema and dataSchema to do type checking and filtering, just like orc reader and parquet reader do. We also use readschema when reading files from cudf now, so it will be a bit complicated to make this change. Filed a followed issue #6226 to track it.

Fixes #6131

Signed-off-by: thirtiseven ntlihy@gmail.com

Signed-off-by: thirtiseven <ntlihy@gmail.com>
@thirtiseven
Copy link
Collaborator Author

build

Copy link
Collaborator

@firestarman firestarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to file a follow-up issue for building the evolved schema you mentioned.

Signed-off-by: thirtiseven <ntlihy@gmail.com>
@thirtiseven
Copy link
Collaborator Author

Filed a follow up issue issue6226.

@sameerz sameerz added the bug Something isn't working label Aug 4, 2022
@sameerz
Copy link
Collaborator

sameerz commented Aug 4, 2022

@GaryShen2008 is this for 22.08? If so, it will need to be approved and merged by Friday Aug 5.

@thirtiseven thirtiseven requested a review from firestarman August 5, 2022 01:10
Copy link
Collaborator

@firestarman firestarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thirtiseven thirtiseven marked this pull request as ready for review August 5, 2022 01:51
@thirtiseven thirtiseven merged commit bea7fc6 into NVIDIA:branch-22.08 Aug 5, 2022
@thirtiseven thirtiseven self-assigned this Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] count() in avro failed when reader_types is coalescing
3 participants