Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10063][SQL] Remove DirectParquetOutputCommitter #12229

Closed
wants to merge 2 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Apr 7, 2016

What changes were proposed in this pull request?

This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue.

How was this patch tested?

Removed the related tests also.

@rxin
Copy link
Contributor Author

rxin commented Apr 7, 2016

cc @davies

@davies
Copy link
Contributor

davies commented Apr 7, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Apr 7, 2016

Test build #55182 has finished for PR 12229 at commit 8719c26.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 7, 2016

Test build #55192 has finished for PR 12229 at commit c5de86b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor Author

rxin commented Apr 7, 2016

Merging in master.

@asfgit asfgit closed this in 9ca0760 Apr 7, 2016
@@ -129,16 +129,17 @@ private[sql] abstract class BaseWriterContainer(
outputWriterFactory.newInstance(path, bucketId, dataSchema, taskAttemptContext)
} catch {
case e: org.apache.hadoop.fs.FileAlreadyExistsException =>
if (outputCommitter.isInstanceOf[parquet.DirectParquetOutputCommitter]) {
// Spark-11382: DirectParquetOutputCommitter is not idempotent, meaning on retry
if (outputCommitter.getClass.getName.contains("Direct")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty brittle/ugly. Is there any other way, such as having an interface covering commit semantics. then the code can go .isInstanceOf[NonAtomicCommitter]?

@mortada
Copy link
Contributor

mortada commented Aug 22, 2016

@rxin so it seems like DirectParquetOutputCommitter has been removed with Spark 2.0, is there a recommended replacement?

(I'm in the process of migrating form Spark 1.6 to 2.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants