[SPARK-25595] Ignore corrupt Avro files if flag IGNORE_CORRUPT_FILES enabled #22611

gengliangwang · 2018-10-02T08:34:18Z

What changes were proposed in this pull request?

With flag IGNORE_CORRUPT_FILES enabled, schema inference should ignore corrupt Avro files, which is consistent with Parquet and Orc data source.

How was this patch tested?

Unit test

SparkQA · 2018-10-02T08:55:12Z

Test build #96850 has finished for PR 22611 at commit e96ea20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-02T09:06:26Z

Test build #96851 has finished for PR 22611 at commit 404b1a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-03T02:33:47Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala

+      files: Seq[FileStatus],
+      conf: Configuration,
+      ignoreExtension: Boolean): Schema = {
+    val ignoreCorruptFiles = SQLConf.get.ignoreCorruptFiles


How about matching it to sparkSession.sessionState.conf.ignoreCorruptFiles like other occurrences?

HyukjinKwon · 2018-10-03T02:38:02Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala

+      if (!ignoreExtension && !path.getName.endsWith(".avro")) {
+        None
+      } else {
+        val in = new FsInput(path, conf)


Not a big deal but we can use Utils.tryiWithResource

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala

HyukjinKwon · 2018-10-03T02:39:14Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+    val corruptFile = new File(dir, "corrupt.avro")
+    val writer = new BufferedWriter(new FileWriter(corruptFile))
+    writer.write("corrupt")
+    writer.close()


ditto for tryWithResource

HyukjinKwon · 2018-10-03T02:59:53Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+        val schema = df.schema
+        val result = df.collect()
+        // Schema inference picks random readable sample file.
+        // Here we use a loop to eliminate randomness.


Actually I don't think it's randomness in this test. In this test, HDFS lists files in an alphabetical order under to the hood although it's not guaranteed as far as I know. I think the picking order here at least is deterministic.

HyukjinKwon · 2018-10-03T03:01:29Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+      withTempPath { dir =>
+        createDummyCorruptFile(dir)
+        val message = intercept[org.apache.spark.SparkException] {
+          spark.read.format("avro").load(dir.getAbsolutePath).schema


.schema wouldn't probably be needed.

HyukjinKwon

LGTM otherwise

SparkQA · 2018-10-03T08:44:08Z

Test build #96891 has finished for PR 22611 at commit 692334a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-10-03T09:06:55Z

@HyukjinKwon Thanks for the review :)

HyukjinKwon · 2018-10-03T09:08:23Z

Merged to master.

…enabled ## What changes were proposed in this pull request? With flag `IGNORE_CORRUPT_FILES` enabled, schema inference should ignore corrupt Avro files, which is consistent with Parquet and Orc data source. ## How was this patch tested? Unit test Closes apache#22611 from gengliangwang/ignoreCorruptAvro. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

…enabled With flag `IGNORE_CORRUPT_FILES` enabled, schema inference should ignore corrupt Avro files, which is consistent with Parquet and Orc data source. Unit test Closes apache#22611 from gengliangwang/ignoreCorruptAvro. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 928d073) RB=1517504 BUG=LIHADOOP-43202 R=fli,mshen,yezhou,edlu A=fli

gengliangwang added 2 commits October 2, 2018 16:21

Ignore corrupt Avro file if flag IGNORE_CORRUPT_FILES enabled

e96ea20

enhance test case

404b1a0

HyukjinKwon reviewed Oct 3, 2018

View reviewed changes

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala Show resolved Hide resolved

HyukjinKwon reviewed Oct 3, 2018

View reviewed changes

HyukjinKwon approved these changes Oct 3, 2018

View reviewed changes

address comments

692334a

asfgit closed this in 928d073 Oct 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25595] Ignore corrupt Avro files if flag IGNORE_CORRUPT_FILES enabled #22611

[SPARK-25595] Ignore corrupt Avro files if flag IGNORE_CORRUPT_FILES enabled #22611

gengliangwang commented Oct 2, 2018 •

edited

Loading

SparkQA commented Oct 2, 2018

SparkQA commented Oct 2, 2018

HyukjinKwon Oct 3, 2018 •

edited

Loading

HyukjinKwon Oct 3, 2018

HyukjinKwon Oct 3, 2018

HyukjinKwon Oct 3, 2018 •

edited

Loading

HyukjinKwon Oct 3, 2018

HyukjinKwon left a comment

SparkQA commented Oct 3, 2018

gengliangwang commented Oct 3, 2018

HyukjinKwon commented Oct 3, 2018

[SPARK-25595] Ignore corrupt Avro files if flag IGNORE_CORRUPT_FILES enabled #22611

[SPARK-25595] Ignore corrupt Avro files if flag IGNORE_CORRUPT_FILES enabled #22611

Conversation

gengliangwang commented Oct 2, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 2, 2018

SparkQA commented Oct 2, 2018

HyukjinKwon Oct 3, 2018 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Oct 3, 2018

Choose a reason for hiding this comment

HyukjinKwon Oct 3, 2018

Choose a reason for hiding this comment

HyukjinKwon Oct 3, 2018 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Oct 3, 2018

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

SparkQA commented Oct 3, 2018

gengliangwang commented Oct 3, 2018

HyukjinKwon commented Oct 3, 2018

gengliangwang commented Oct 2, 2018 •

edited

Loading

HyukjinKwon Oct 3, 2018 •

edited

Loading

HyukjinKwon Oct 3, 2018 •

edited

Loading