Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25595] Ignore corrupt Avro files if flag IGNORE_CORRUPT_FILES enabled #22611

Closed
wants to merge 3 commits into from

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Oct 2, 2018

What changes were proposed in this pull request?

With flag IGNORE_CORRUPT_FILES enabled, schema inference should ignore corrupt Avro files, which is consistent with Parquet and Orc data source.

How was this patch tested?

Unit test

@SparkQA
Copy link

SparkQA commented Oct 2, 2018

Test build #96850 has finished for PR 22611 at commit e96ea20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 2, 2018

Test build #96851 has finished for PR 22611 at commit 404b1a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

files: Seq[FileStatus],
conf: Configuration,
ignoreExtension: Boolean): Schema = {
val ignoreCorruptFiles = SQLConf.get.ignoreCorruptFiles
Copy link
Member

@HyukjinKwon HyukjinKwon Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about matching it to sparkSession.sessionState.conf.ignoreCorruptFiles like other occurrences?

if (!ignoreExtension && !path.getName.endsWith(".avro")) {
None
} else {
val in = new FsInput(path, conf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal but we can use Utils.tryiWithResource

val corruptFile = new File(dir, "corrupt.avro")
val writer = new BufferedWriter(new FileWriter(corruptFile))
writer.write("corrupt")
writer.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for tryWithResource

val schema = df.schema
val result = df.collect()
// Schema inference picks random readable sample file.
// Here we use a loop to eliminate randomness.
Copy link
Member

@HyukjinKwon HyukjinKwon Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't think it's randomness in this test. In this test, HDFS lists files in an alphabetical order under to the hood although it's not guaranteed as far as I know. I think the picking order here at least is deterministic.

withTempPath { dir =>
createDummyCorruptFile(dir)
val message = intercept[org.apache.spark.SparkException] {
spark.read.format("avro").load(dir.getAbsolutePath).schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.schema wouldn't probably be needed.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@SparkQA
Copy link

SparkQA commented Oct 3, 2018

Test build #96891 has finished for PR 22611 at commit 692334a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

@HyukjinKwon Thanks for the review :)

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 928d073 Oct 3, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…enabled

## What changes were proposed in this pull request?

With flag `IGNORE_CORRUPT_FILES` enabled, schema inference should ignore corrupt Avro files, which is consistent with Parquet and Orc data source.

## How was this patch tested?

Unit test

Closes apache#22611 from gengliangwang/ignoreCorruptAvro.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
gengliangwang added a commit to gengliangwang/spark that referenced this pull request Apr 25, 2020
…enabled

## What changes were proposed in this pull request?

With flag `IGNORE_CORRUPT_FILES` enabled, schema inference should ignore corrupt Avro files, which is consistent with Parquet and Orc data source.

## How was this patch tested?

Unit test

Closes apache#22611 from gengliangwang/ignoreCorruptAvro.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…enabled

With flag `IGNORE_CORRUPT_FILES` enabled, schema inference should ignore corrupt Avro files, which is consistent with Parquet and Orc data source.

Unit test

Closes apache#22611 from gengliangwang/ignoreCorruptAvro.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
(cherry picked from commit 928d073)

RB=1517504
BUG=LIHADOOP-43202
R=fli,mshen,yezhou,edlu
A=fli
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants