[SPARK-10063][SQL] Remove DirectParquetOutputCommitter #12229

rxin · 2016-04-07T04:37:08Z

What changes were proposed in this pull request?

This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue.

How was this patch tested?

Removed the related tests also.

rxin · 2016-04-07T04:37:36Z

cc @davies

davies · 2016-04-07T04:53:35Z

LGTM

SparkQA · 2016-04-07T05:44:35Z

Test build #55182 has finished for PR 12229 at commit 8719c26.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-07T07:44:28Z

Test build #55192 has finished for PR 12229 at commit c5de86b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-04-07T07:51:42Z

Merging in master.

steveloughran · 2016-04-07T08:18:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala

@@ -129,16 +129,17 @@ private[sql] abstract class BaseWriterContainer(
      outputWriterFactory.newInstance(path, bucketId, dataSchema, taskAttemptContext)
    } catch {
      case e: org.apache.hadoop.fs.FileAlreadyExistsException =>
-        if (outputCommitter.isInstanceOf[parquet.DirectParquetOutputCommitter]) {
-          // Spark-11382: DirectParquetOutputCommitter is not idempotent, meaning on retry
+        if (outputCommitter.getClass.getName.contains("Direct")) {


this is pretty brittle/ugly. Is there any other way, such as having an interface covering commit semantics. then the code can go .isInstanceOf[NonAtomicCommitter]?

mortada · 2016-08-22T21:45:33Z

@rxin so it seems like DirectParquetOutputCommitter has been removed with Spark 2.0, is there a recommended replacement?

(I'm in the process of migrating form Spark 1.6 to 2.0)

[SPARK-10063][SQL] Remove DirectParquetOutputCommitter

8719c26

remove test case

c5de86b

asfgit closed this in 9ca0760 Apr 7, 2016

steveloughran reviewed Apr 7, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10063][SQL] Remove DirectParquetOutputCommitter #12229

[SPARK-10063][SQL] Remove DirectParquetOutputCommitter #12229

rxin commented Apr 7, 2016

rxin commented Apr 7, 2016

davies commented Apr 7, 2016

SparkQA commented Apr 7, 2016

SparkQA commented Apr 7, 2016

rxin commented Apr 7, 2016

steveloughran Apr 7, 2016

mortada commented Aug 22, 2016

[SPARK-10063][SQL] Remove DirectParquetOutputCommitter #12229

[SPARK-10063][SQL] Remove DirectParquetOutputCommitter #12229

Conversation

rxin commented Apr 7, 2016

What changes were proposed in this pull request?

How was this patch tested?

rxin commented Apr 7, 2016

davies commented Apr 7, 2016

SparkQA commented Apr 7, 2016

SparkQA commented Apr 7, 2016

rxin commented Apr 7, 2016

steveloughran Apr 7, 2016

Choose a reason for hiding this comment

mortada commented Aug 22, 2016