[SPARK-41708][SQL] Pull v1write information to `WriteFiles` #39277

ulysses-you · 2022-12-29T03:46:39Z

What changes were proposed in this pull request?

This pr aims to pull out the v1write information from V1WriteCommand to WriteFiles:

case class WriteFiles(child: LogicalPlan)

=>

case class WriteFiles(
    child: LogicalPlan,
    fileFormat: FileFormat,
    partitionColumns: Seq[Attribute],
    bucketSpec: Option[BucketSpec],
    options: Map[String, String],
    staticPartitions: TablePartitionSpec)

Also, this pr do a cleanup for WriteSpec which is unnecessary.

Why are the changes needed?

After this pr, WriteFiles will hold write information that can help developers

Does this PR introduce any user-facing change?

no

How was this patch tested?

Pass CI

ulysses-you · 2022-12-30T02:48:35Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/V1WritesHiveUtils.scala

+   * The call side should create `stagingDir` before using `externalTmpPath` and
+   * delete `stagingDir` at the end.
+   */
+  protected def getExternalTmpPath(


This is the key change for hive insertion. Before this method has a side effect of creating the stagingDir. Now, this method return two paths, one is staging dir for creating and the other is the original externalTmpPath.

ulysses-you · 2022-12-30T02:50:37Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala

    try {
      if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) {
        throw new IllegalStateException("Cannot create staging directory  '" + dir.toString + "'")
      }
-      createdTempDir = Some(dir)


The global variable createdTempDir is really hack. Since we have specified staging dir, we can pass it to the method deleteExternalTmpPath, then we do not need it anymore.

ulysses-you · 2022-12-30T02:52:46Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

    try {
-      processInsert(sparkSession, externalCatalog, hadoopConf, tableDesc, tmpLocation, child)
+      processInsert(sparkSession, externalCatalog, child)


now the code looks like:

create stagingDir try { processInsert } finally { delete stagingDir }

ulysses-you · 2022-12-30T02:53:27Z

cc @cloud-fan

cloud-fan · 2022-12-30T04:37:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteFiles.scala

+    partitionColumns: Seq[Attribute],
+    bucketSpec: Option[BucketSpec],
+    options: Map[String, String],
+    requiredOrdering: Seq[SortOrder]) extends UnaryNode {


This doesn't seem like a logical write information, but more of internal information. Do we really need it here?

how about pull out partitionSpec instead ? partitionColumns does not contain the information of the insertion partition spec.

cloud-fan · 2022-12-30T04:39:14Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+    options: Map[String, String],
+    fileFormat: FileFormat,
+    externalTmpPath: String,
+    @transient stagingDir: Path,


what's the difference between externalTmpPath and stagingDir?

for old hive version, externalTmpPath and stagingDir are the same.

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala

Lines 129 to 136 in a3c837a

val hiveVersion = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version

val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")

val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")

if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) {

oldVersionExternalTempPath(path, hadoopConf, scratchDir)

} else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) {

newVersionExternalTempPath(path, hadoopConf, stagingDir)

for new hive version:

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala

Lines 189 to 197 in a3c837a

private def newVersionExternalTempPath(

path: Path,

hadoopConf: Configuration,

stagingDir: String): Path = {

val extURI: URI = path.toUri

if (extURI.getScheme == "viewfs") {

getExtTmpPathRelTo(path, hadoopConf, stagingDir)

} else {

new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), "-ext-10000")

externalTmpPath: new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), "-ext-10000")

stagingDir: getExternalScratchDir(extURI, hadoopConf, stagingDir)

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala

cloud-fan · 2022-12-30T04:42:52Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/V1WritesHiveUtils.scala

+  /**
+   * Return two paths:
+   * 1. The first path is `stagingDir` which can be the parent path of `externalTmpPath`
+   * 2. The second path is `externalTmpPath`, e.g. `$stagingDir/-ext-10000`


Does Hadoop Path provide API to get parent? If it does then we don't need to return 2 paths

it not only can be the parent. for old version hive, they are the same. So if we want reduce one path, we should check the hive version again before using it.

cloud-fan · 2023-01-04T15:43:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteFiles.scala

-    val writeFilesSpec: WriteFilesSpec = writeSpec.asInstanceOf[WriteFilesSpec]
-
+  override protected def doExecuteWrite(
+      writeFilesSpec: WriteFilesSpec): RDD[WriterCommitMessage] = {


should WriteFilesSpec include less information as some information are already available in WriteFilesExec?

Seems it's a bit hard. look at the current information:

case class WriteFilesSpec( description: WriteJobDescription, committer: FileCommitProtocol, concurrentOutputWriterSpecFunc: SparkPlan => Option[ConcurrentOutputWriterSpec])

ConcurrentOutputWriterSpec and FileCommitProtocol contain the output spec so we can not replace them

WriteJobDescription contains many information which includes what we pull out, but if we want to reduce something inside WriteJobDescription, we need to create a new class to hold others. I'm not sure it's worth to do that.

class WriteJobDescription( val uuid: String, val serializableHadoopConf: SerializableConfiguration, val outputWriterFactory: OutputWriterFactory, val allColumns: Seq[Attribute], val dataColumns: Seq[Attribute], val partitionColumns: Seq[Attribute], val bucketSpec: Option[WriterBucketSpec], val path: String, val customPartitionLocations: Map[TablePartitionSpec, String], val maxRecordsPerFile: Long, val timeZoneId: String, val statsTrackers: Seq[WriteJobStatsTracker])

cloud-fan · 2023-01-04T15:49:06Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/V1WritesHiveUtils.scala

+   * 1. The first path is `stagingDir` which can be the parent path of `externalTmpPath`
+   * 2. The second path is `externalTmpPath`, e.g. `$stagingDir/-ext-10000`
+   * The call side should create `stagingDir` before using `externalTmpPath` and
+   * delete `stagingDir` at the end.


Instead of adding a lot of comments to explain it, let's create a wrapper class

class HiveTableTempPath(session: SparkSession, conf: HadoopConf, path: Path) { ... def stagingDir: Path = ... def externalTempPath: Path = ... }

wrapped using HiveTempPath since it would be used by InsertIntoHiveDirCommand

cloud-fan · 2023-01-05T12:30:03Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+    bucketSpec: Option[BucketSpec],
+    options: Map[String, String],
+    fileFormat: FileFormat,
+    @transient externalTmpPath: HiveTempPath


Suggested change

@transient externalTmpPath: HiveTempPath

@transient hiveTmpPath: HiveTempPath

cloud-fan · 2023-01-05T12:31:17Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/V1WritesHiveUtils.scala

+import org.apache.spark.sql.hive.HiveExternalCatalog
+import org.apache.spark.sql.hive.client.{HiveClientImpl, HiveVersion}
+
+class HiveTempPath(session: SparkSession, val hadoopConf: Configuration, path: Path)


can we move it to a new file?

cloud-fan · 2023-01-05T12:33:41Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/V1WritesHiveUtils.scala

@@ -105,4 +227,33 @@ trait V1WritesHiveUtils {
      .map(_ => Map(BucketingUtils.optionForHiveCompatibleBucketWrite -> "true"))
      .getOrElse(Map.empty)
  }
+
+  def setupCompression(


I think setupHadoopConfForCompression is more accurate.

cloud-fan · 2023-01-05T12:35:42Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

    } finally {
      // Attempt to delete the staging directory and the inclusive files. If failed, the files are
      // expected to be dropped at the normal termination of VM since deleteOnExit is used.
-      deleteExternalTmpPath(hadoopConf)
+      deleteExternalTmpPath(stagingDir, hadoopConf)


Can we add def createTempPath() and def deleteTempPath() in HiveTempPath? Then we don't even need to expose the stagingDir, which makes the interface cleaner.

ulysses-you · 2023-01-05T14:44:12Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTempPath.scala

+  }
+
+  def deleteIfNotStagingDir(path: Path, fs: FileSystem): Unit = {
+    if (Option(path) != stagingDirForCreating) fs.delete(path, true)


one more method for InsertIntoHiveDirCommand so we can hide staging dir.

cloud-fan · 2023-01-06T04:13:27Z

thanks, merging to master!

cloud-fan · 2023-01-06T12:57:44Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

@@ -294,3 +285,40 @@ case class InsertIntoHiveTable(
  override protected def withNewChildInternal(newChild: LogicalPlan): InsertIntoHiveTable =
    copy(query = newChild)
 }
+
+object InsertIntoHiveTable extends V1WritesHiveUtils with Logging {


what do we log inside this object?

oh, I missed to clean up it. Will remove it when I touch the related code

…ew query ### What changes were proposed in this pull request? This is the followup of #39277, does three things: - replace WriteFiles attribute exprId using new query to avoid potential issue - remove unnecessary explain info with `WriteFiles` - cleanup unnecessary `Logging` ### Why are the changes needed? Improve the implementation of `WriteFiles` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add test Closes #39468 from ulysses-you/SPARK-41708-followup. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… before write command ### What changes were proposed in this pull request? This is a followup of #39277 . With planned write, the write command requires neither columnar nor row-based execution. It invokes a new API `executeWrite`, which returns commit messages, not columnar or row-based data. This PR updates `ApplyColumnarRulesAndInsertTransitions` to take this case into consideration. ### Why are the changes needed? If people replaces `WriteFilesExec` with a columnar version, the plan can't be executed due to an extra columnar to row transition between `WriteFilesExee` and the write command. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test Closes #39922 from cloud-fan/write. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… before write command ### What changes were proposed in this pull request? This is a followup of #39277 . With planned write, the write command requires neither columnar nor row-based execution. It invokes a new API `executeWrite`, which returns commit messages, not columnar or row-based data. This PR updates `ApplyColumnarRulesAndInsertTransitions` to take this case into consideration. ### Why are the changes needed? If people replaces `WriteFilesExec` with a columnar version, the plan can't be executed due to an extra columnar to row transition between `WriteFilesExee` and the write command. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test Closes #39922 from cloud-fan/write. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 56dd20f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… before write command ### What changes were proposed in this pull request? This is a followup of apache#39277 . With planned write, the write command requires neither columnar nor row-based execution. It invokes a new API `executeWrite`, which returns commit messages, not columnar or row-based data. This PR updates `ApplyColumnarRulesAndInsertTransitions` to take this case into consideration. ### Why are the changes needed? If people replaces `WriteFilesExec` with a columnar version, the plan can't be executed due to an extra columnar to row transition between `WriteFilesExee` and the write command. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test Closes apache#39922 from cloud-fan/write. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 56dd20f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Dec 29, 2022

Pull v1write information to write file node

76c875d

ulysses-you force-pushed the SPARK-41708 branch from a671e8e to 76c875d Compare December 30, 2022 02:44

ulysses-you changed the title ~~[SPARK-41708][SQL] Pull v1write information to write file node~~ [SPARK-41708][SQL] Pull v1write information to WriteFiles Dec 30, 2022

ulysses-you commented Dec 30, 2022

View reviewed changes

cloud-fan reviewed Dec 30, 2022

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala Show resolved Hide resolved

cloud-fan reviewed Dec 30, 2022

View reviewed changes

ulysses-you added 2 commits December 30, 2022 16:55

partition spec

d22359c

nit

38426e2

cloud-fan reviewed Jan 4, 2023

View reviewed changes

ulysses-you added 2 commits January 5, 2023 10:22

address comments

6b6d5d3

nit

6bc5186

cloud-fan reviewed Jan 5, 2023

View reviewed changes

address comments

ba141cd

ulysses-you commented Jan 5, 2023

View reviewed changes

cloud-fan approved these changes Jan 6, 2023

View reviewed changes

cloud-fan closed this in 27e20fe Jan 6, 2023

cloud-fan reviewed Jan 6, 2023

View reviewed changes

ulysses-you deleted the SPARK-41708 branch January 9, 2023 13:04

ulysses-you mentioned this pull request Jan 9, 2023

[SPARK-41708][SQL][FOLLOWUP] WriteFiles should replace exprId using new query #39468

Closed

cloud-fan mentioned this pull request Feb 7, 2023

[SPARK-41708][SQL][FOLLOWUP] Do not insert columnar to row transition before write command #39922

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41708][SQL] Pull v1write information to `WriteFiles` #39277

[SPARK-41708][SQL] Pull v1write information to `WriteFiles` #39277

ulysses-you commented Dec 29, 2022 •

edited

Loading

ulysses-you Dec 30, 2022

ulysses-you Dec 30, 2022 •

edited

Loading

ulysses-you Dec 30, 2022

ulysses-you commented Dec 30, 2022

cloud-fan Dec 30, 2022

ulysses-you Dec 30, 2022

cloud-fan Dec 30, 2022

cloud-fan Dec 30, 2022

ulysses-you Dec 30, 2022

cloud-fan Dec 30, 2022

ulysses-you Dec 30, 2022

cloud-fan Jan 4, 2023

ulysses-you Jan 5, 2023

cloud-fan Jan 4, 2023

ulysses-you Jan 5, 2023

cloud-fan Jan 5, 2023

cloud-fan Jan 5, 2023

cloud-fan Jan 5, 2023

cloud-fan Jan 5, 2023

ulysses-you Jan 5, 2023

ulysses-you Jan 5, 2023

cloud-fan commented Jan 6, 2023

cloud-fan Jan 6, 2023

ulysses-you Jan 9, 2023

	val hiveVersion = externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version
	val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
	val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")

	if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) {
	oldVersionExternalTempPath(path, hadoopConf, scratchDir)
	} else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) {
	newVersionExternalTempPath(path, hadoopConf, stagingDir)

	private def newVersionExternalTempPath(
	path: Path,
	hadoopConf: Configuration,
	stagingDir: String): Path = {
	val extURI: URI = path.toUri
	if (extURI.getScheme == "viewfs") {
	getExtTmpPathRelTo(path, hadoopConf, stagingDir)
	} else {
	new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), "-ext-10000")

	@transient externalTmpPath: HiveTempPath
	@transient hiveTmpPath: HiveTempPath

[SPARK-41708][SQL] Pull v1write information to WriteFiles #39277

[SPARK-41708][SQL] Pull v1write information to WriteFiles #39277

Conversation

ulysses-you commented Dec 29, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

ulysses-you Dec 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ulysses-you commented Dec 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[SPARK-41708][SQL] Pull v1write information to `WriteFiles` #39277

[SPARK-41708][SQL] Pull v1write information to `WriteFiles` #39277

ulysses-you commented Dec 29, 2022 •

edited

Loading

ulysses-you Dec 30, 2022 •

edited

Loading