[SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala #47301

zedtang · 2024-07-11T09:32:13Z

What changes were proposed in this pull request?

Introduce a new clusterBy DataFrame API in Scala. This PR adds the API for both the DataFrameWriter V1 and V2, as well as Spark Connect.

Why are the changes needed?

Introduce more ways for users to interact with clustered tables.

Does this PR introduce any user-facing change?

Yes, it adds a new clusterBy DataFrame API in Scala to allow specifying the clustering columns when writing DataFrames.

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

zedtang · 2024-07-11T16:50:35Z

Hi @cloud-fan , @imback82 , @dabao521, this PR is ready for review

cloud-fan · 2024-07-16T08:56:43Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -201,6 +201,22 @@ final class DataFrameWriter[T] private[sql] (ds: Dataset[T]) {
    this
  }

+  /**
+   * Clusters the output by the given columns on the file system. The rows with matching values in


let's be a bit more general as data sources are not always based on file system. How about ... given columns on the storage.?

sure, updated here and below

cloud-fan · 2024-07-16T08:58:37Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -201,6 +201,22 @@ final class DataFrameWriter[T] private[sql] (ds: Dataset[T]) {
    this
  }

+  /**
+   * Clusters the output by the given columns on the file system. The rows with matching values in
+   * the specified clustering columns will be consolidated within the same file.


ditto, ... will be consolidated within the same group.

cloud-fan · 2024-07-16T09:01:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+   *
+   * @param clusterBySpec : existing ClusterBySpec to be converted to properties.
+   */
+  def toProperties(clusterBySpec: ClusterBySpec): Map[String, String] = {


what's the difference between this and toProperty?

Besides the different return type, toProperty additionally does validation of the clustering columns against table schema, therefore it has 2 more input parameters (schema and resovler).

I updated the comments.

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala

cloud-fan · 2024-07-23T01:28:16Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala

@@ -274,6 +283,18 @@ trait CreateTableWriter[T] extends WriteConfigMethods[CreateTableWriter[T]] {
   */
  def partitionedBy(column: Column, columns: Column*): CreateTableWriter[T]

+  /**
+   * Clusters the output by the given columns on the file system. The rows with matching values in
+   * the specified clustering columns will be consolidated within the same file.


can we update the api doc everywhere?

Oops, updated here and below

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala

cloud-fan · 2024-07-23T01:30:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

@@ -209,10 +221,26 @@ object ClusterBySpec {
      normalizeClusterBySpec(schema, clusterBySpec, resolver).toJson
  }

+  /**
+   * Converts a ClusterBySpec to a map of table properties used to store the clustering


I'm confused, why do we prefer a map with only one entry over a single tuple2 like toProperty does?

No preference here, this is just a bit more friendly for the call site.

then let's be consistent here and return tuple2. The name can be toPropertyWithoutValidation

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala

cloud-fan · 2024-07-24T07:00:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+   * @return a map entry for the clustering column property.
+   */
+  def toPropertyWithoutValidation(clusterBySpec: ClusterBySpec): (String, String) = {
+    val columnValue = mapper.writeValueAsString(clusterBySpec.columnNames.map(_.fieldNames))


is it the same as clusterBySpec.toJson? If yes then we can simply do

CatalogTable.PROP_CLUSTERING_COLUMNS -> ClusterBySpec.toJson

Yeah, you're right. Updated

cloud-fan · 2024-07-24T07:05:58Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -708,7 +746,10 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
    }.getOrElse(Seq.empty[Transform])
    val bucketing =
      getBucketSpec.map(spec => CatalogV2Implicits.BucketSpecHelper(spec).asTransform).toSeq
-    partitioning ++ bucketing
+    val clustering = clusteringColumns.map { colNames =>
+      ClusterByTransform(colNames.map(col => FieldReference(col)))


Suggested change

ClusterByTransform(colNames.map(col => FieldReference(col)))

ClusterByTransform(colNames.map(FieldReference(_)))

cloud-fan · 2024-07-24T07:11:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

@@ -1373,6 +1373,65 @@ abstract class DDLSuite extends QueryTest with DDLSuiteBase {
    }
  }

+  test("Clustering columns should match when appending to existing data source tables") {


shall we put it in DataFrameReaderWriterSuite?

sure, moved there.

cloud-fan

LGTM except for some minor comments

cloud-fan · 2024-07-25T04:57:05Z

thanks, merging to master!

### What changes were proposed in this pull request? Introduce a new `clusterBy` DataFrame API in Scala. This PR adds the API for both the DataFrameWriter V1 and V2, as well as Spark Connect. ### Why are the changes needed? Introduce more ways for users to interact with clustered tables. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new `clusterBy` DataFrame API in Scala to allow specifying the clustering columns when writing DataFrames. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47301 from zedtang/clusterby-scala-api. Authored-by: Jiaheng Tang <jiaheng.tang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Jul 11, 2024

zedtang changed the title ~~[SPARK-48761][SQL] Add clusterBy DataFrame API for Scala~~ [SPARK-48761][SQL] Introduce clusterBy DataFrame API for Scala Jul 11, 2024

github-actions bot added BUILD CONNECT labels Jul 11, 2024

zedtang added 4 commits July 15, 2024 14:42

initial commit

07088ca

fix mima

d60728d

include spark connect

2d288c6

add missing piece

9e155b0

zedtang force-pushed the clusterby-scala-api branch 3 times, most recently from 0ac92ea to bbc7002 Compare July 15, 2024 22:50

Fix test

b5988f2

zedtang force-pushed the clusterby-scala-api branch from bbc7002 to b5988f2 Compare July 15, 2024 22:56

zedtang changed the title ~~[SPARK-48761][SQL] Introduce clusterBy DataFrame API for Scala~~ [SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala Jul 15, 2024

cloud-fan reviewed Jul 16, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala Show resolved Hide resolved

chirag-s-db mentioned this pull request Jul 18, 2024

[SPARK-48901][SPARK-48916][SS][PYTHON] Introduce clusterBy DataStreamWriter API #47376

Closed

address comments

32563ea

zedtang requested a review from cloud-fan July 22, 2024 15:54

zedtang added a commit to zedtang/spark that referenced this pull request Jul 22, 2024

import apache/pull/47301 at commit 32563ea

48af83a

zedtang mentioned this pull request Jul 22, 2024

[SPARK-45787][SQL] Support Catalog.listColumns for clustering columns #47451

Closed

scalafmt

35412d9

cloud-fan reviewed Jul 23, 2024

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala Show resolved Hide resolved

cloud-fan reviewed Jul 23, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala Show resolved Hide resolved

update comments

9ff7cab

zedtang requested a review from cloud-fan July 23, 2024 06:21

address comment

6599fa2

cloud-fan reviewed Jul 24, 2024

View reviewed changes

cloud-fan approved these changes Jul 24, 2024

View reviewed changes

address comments

8ab9e60

cloud-fan closed this in bafce5d Jul 25, 2024

zedtang deleted the clusterby-scala-api branch July 26, 2024 02:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala #47301

[SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala #47301

zedtang commented Jul 11, 2024 •

edited

Loading

zedtang commented Jul 11, 2024

cloud-fan Jul 16, 2024

zedtang Jul 22, 2024

cloud-fan Jul 16, 2024

cloud-fan Jul 16, 2024

zedtang Jul 22, 2024

cloud-fan Jul 23, 2024

zedtang Jul 23, 2024

cloud-fan Jul 23, 2024

zedtang Jul 23, 2024

cloud-fan Jul 23, 2024

zedtang Jul 23, 2024

cloud-fan Jul 24, 2024

zedtang Jul 24, 2024

cloud-fan Jul 24, 2024

cloud-fan Jul 24, 2024

zedtang Jul 24, 2024

cloud-fan left a comment

cloud-fan commented Jul 25, 2024

	ClusterByTransform(colNames.map(col => FieldReference(col)))
	ClusterByTransform(colNames.map(FieldReference(_)))

[SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala #47301

[SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala #47301

Conversation

zedtang commented Jul 11, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zedtang commented Jul 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

cloud-fan commented Jul 25, 2024

zedtang commented Jul 11, 2024 •

edited

Loading