-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25595] Ignore corrupt Avro files if flag IGNORE_CORRUPT_FILES enabled #22611
Conversation
Test build #96850 has finished for PR 22611 at commit
|
Test build #96851 has finished for PR 22611 at commit
|
files: Seq[FileStatus], | ||
conf: Configuration, | ||
ignoreExtension: Boolean): Schema = { | ||
val ignoreCorruptFiles = SQLConf.get.ignoreCorruptFiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about matching it to sparkSession.sessionState.conf.ignoreCorruptFiles
like other occurrences?
if (!ignoreExtension && !path.getName.endsWith(".avro")) { | ||
None | ||
} else { | ||
val in = new FsInput(path, conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal but we can use Utils.tryiWithResource
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
Show resolved
Hide resolved
val corruptFile = new File(dir, "corrupt.avro") | ||
val writer = new BufferedWriter(new FileWriter(corruptFile)) | ||
writer.write("corrupt") | ||
writer.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for tryWithResource
val schema = df.schema | ||
val result = df.collect() | ||
// Schema inference picks random readable sample file. | ||
// Here we use a loop to eliminate randomness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't think it's randomness in this test. In this test, HDFS lists files in an alphabetical order under to the hood although it's not guaranteed as far as I know. I think the picking order here at least is deterministic.
withTempPath { dir => | ||
createDummyCorruptFile(dir) | ||
val message = intercept[org.apache.spark.SparkException] { | ||
spark.read.format("avro").load(dir.getAbsolutePath).schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.schema
wouldn't probably be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
Test build #96891 has finished for PR 22611 at commit
|
@HyukjinKwon Thanks for the review :) |
Merged to master. |
…enabled ## What changes were proposed in this pull request? With flag `IGNORE_CORRUPT_FILES` enabled, schema inference should ignore corrupt Avro files, which is consistent with Parquet and Orc data source. ## How was this patch tested? Unit test Closes apache#22611 from gengliangwang/ignoreCorruptAvro. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>
…enabled ## What changes were proposed in this pull request? With flag `IGNORE_CORRUPT_FILES` enabled, schema inference should ignore corrupt Avro files, which is consistent with Parquet and Orc data source. ## How was this patch tested? Unit test Closes apache#22611 from gengliangwang/ignoreCorruptAvro. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>
…enabled With flag `IGNORE_CORRUPT_FILES` enabled, schema inference should ignore corrupt Avro files, which is consistent with Parquet and Orc data source. Unit test Closes apache#22611 from gengliangwang/ignoreCorruptAvro. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 928d073) RB=1517504 BUG=LIHADOOP-43202 R=fli,mshen,yezhou,edlu A=fli
What changes were proposed in this pull request?
With flag
IGNORE_CORRUPT_FILES
enabled, schema inference should ignore corrupt Avro files, which is consistent with Parquet and Orc data source.How was this patch tested?
Unit test