Iceberg/Comet integration POC #9841

huaxingao · 2024-03-01T02:34:25Z

This PR shows how I will integrate Comet with iceberg. The PR doesn't compile yet because we haven't released Comet yet, but it shows the ideas how we are going to change iceberg code to integrate Comet. Also, Comet doesn't have Spark3.5 support yet so I am doing this on 3.4, but we will add 3.5 support in Comet.

In VectorizedSparkParquetReaders.buildReader, if Comet library is available, a CometIcebergColumnarBatchReader will be created, which will use Comet batch reader to read data. We can also add a property later to control whether we want to use Comet or not.

The logic in CometIcebergVectorizedReaderBuilder is very similar to VectorizedReaderBuilder. It builds Comet column reader instead of iceberg column reader.

The delete logic in CometIcebergColumnarBatchReader is exactly the same as the one in ColumnarBatchReader. I will extract the common code and put the common code in a base class.

The main motivation of this PR is to improve performance using native execution. Comet's Parquet reader is a hybrid implementation: IO and decompression are done in the JVM while decoding is done natively. There is some performance gain from native decoding, but the gain is not much. However, by switching to the Comet Parquet reader, Comet will recognize that this is a Comet scan and will convert the Spark physical plan into a Comet plan for native execution. The major performance gain will be from this native execution.

huaxingao · 2024-03-01T02:41:26Z

cc @aokolnychyi @sunchao

...k/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergColumnReader.java

aokolnychyi

I think this is the right direction to take. I did an initial high-level pass. Looking forward to having a Comet release soon.

...k/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergColumnReader.java

aokolnychyi · 2024-04-16T03:57:07Z

spark/v3.4/build.gradle

    }

+    compileOnly "org.apache.comet:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.1.0-SNAPSHOT"


I assume this library will only contain the reader, not the operators.

Right. This only contains the reader.

Does it need to be Spark Version Dependent? Just wondering

We are currently doing some experiments to see if we can provide a Spark Version independent jar.

+1 for exploring that.

...ain/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergColumnarBatchReader.java

...rk/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java

...in/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergPositionColumnReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkConfParser.java

...v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/BaseColumnBatchLoader.java

....4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometColumnReader.java

api/src/main/java/org/apache/iceberg/ReaderType.java

aokolnychyi · 2024-04-22T22:27:03Z

build.gradle

@@ -45,6 +45,7 @@ buildscript {
  }
 }

+String sparkMajorVersion = '3.4'


I hope we can soon have a snapshot for Comet jar independent of Spark to clean up deps here.
We can't have parquet module depend on a jar with any Spark deps.

spark/v3.4/build.gradle

aokolnychyi · 2024-04-22T22:27:57Z

spark/v3.4/build.gradle

    }

+    compileOnly "org.apache.comet:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.1.0-SNAPSHOT"


+1 for exploring that.

gradle.properties

aokolnychyi · 2024-04-23T00:54:35Z

...v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/BaseColumnBatchLoader.java

+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+
+@SuppressWarnings("checkstyle:VisibilityModifier")


These changes would require a bit more time to review. I'll do that tomorrow. I think we would want to restructure the original implementation a bit. Not a concern for now.

We would want to structure this a bit differently. Let me think more.

...rk/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkColumnarReaderFactory.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java

huaxingao · 2024-04-29T16:59:05Z

@aokolnychyi I have addressed the comments. Could you please take one more look when you have a moment? Thanks a lot!

aokolnychyi · 2024-04-30T17:27:41Z

Will check today.

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/ParquetReaderType.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

aokolnychyi · 2024-04-30T19:04:36Z

...v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/BaseColumnBatchLoader.java

+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+
+@SuppressWarnings("checkstyle:VisibilityModifier")


We would want to structure this a bit differently. Let me think more.

...k/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

....4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometColumnReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkColumnarReaderFactory.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/BatchReadConf.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/CometColumnReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/BatchReadConf.java

cornelcreanga · 2024-06-20T14:09:01Z

@huaxingao - Hi, is the Comet Parquet reader able to support page skipping/use page indexes? -eg see #193 for the Iceberg Parquet reader initial issue.

huaxingao · 2024-06-20T15:41:53Z

@cornelcreanga Comet Parquet reader doesn't support page skipping yet

PaulLiang1 · 2024-09-04T04:13:51Z

hey @huaxingao
we are really interested in this feature, just wonder what can we help to getting this integrated?

huaxingao · 2024-09-04T04:25:24Z

@PaulLiang1 Thank you for your interest! We are currently working on a binary release of DataFusion Comet. Once the binary release is available, I will proceed with this PR.

PaulLiang1 · 2024-09-04T04:53:39Z

@huaxingao
I think we got a internal version of building DataFusion comet and publish a JAR internally.
Is there anything we can help with on that front?

Thanks

huaxingao · 2024-09-04T05:24:49Z

@PaulLiang1 Thanks! I'll check with my colleague tomorrow to find out where we are in the binary release process.

huaxingao · 2024-09-05T04:50:30Z

@PaulLiang1 We are pretty close to this and will have a binary release for Comet soon.

PaulLiang1 · 2024-09-05T05:00:49Z

@PaulLiang1 Thanks! I'll check with my colleague tomorrow to find out where we are in the binary release process.

got it, thanks for letting me know. please feel free to let us know if there is anything we could help on. thanks!

github-actions · 2024-11-29T00:15:50Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

…unkMetaData> metaData)

huaxingao · 2024-12-06T18:23:05Z

@aokolnychyi Could you please take a look again? I have changed the default to Comet to make sure all the tests run successfully with Comet. I will switch back to the regular iceberg vectorization.

Also @szehon-ho

huaxingao · 2024-12-18T02:00:41Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWriterV2.java

@@ -214,7 +215,7 @@ public void testWriteWithCaseSensitiveOption() throws NoSuchTableException, Pars
    Assert.assertEquals(4, fields.size());
  }

-  @Test
+  @Ignore


I am ignoring three tests for now with the Comet configuration. This is only for testing purposes within Comet. I will revert these changes after the PR review.

In the followup PR, I will make SparkScan implement org.apache.comet.parquet.SupportsComet, and then Comet will check the flag in org.apache.comet.parquet.SupportsComet to turn on the native operators. Additionally, Comet will allow conversion from int to long and float to double based on the flag in org.apache.comet.parquet.SupportsComet, which means, these tests will pass after the followup PR.

dpengpeng · 2024-12-19T03:43:49Z

@huaxingao I have used Comet on Spark and the results are very good. Now I see Iceberg and Comet working together, which is a great attempt. I look forward to Iceberg merging this POC code as soon as possible. In addition, does the current POC code support Comet writing Iceberg data?

huaxingao · 2024-12-19T07:16:41Z

@dpengpeng Thank you for your positive feedback! I’m glad to hear you’ve had good results with Comet on Spark. The current POC doesn't support writing Iceberg data yet, but we plan to add this capability in the future.

parthchandra · 2024-12-19T23:40:17Z

@aokolnychyi @RussellSpitzer @szehon-ho Is this getting closer to completion? asking for a friend :)
Also, the Comet team is looking at providing complex type support in the reader and our choices are being heavily influenced by this PR so it would be good if we know what APIs we must guarantee.

github-actions bot added spark build labels Mar 1, 2024

huaxingao mentioned this pull request Mar 1, 2024

Dynamically support Spark native engine in Iceberg #9826

Closed

huaxingao mentioned this pull request Mar 5, 2024

Dynamically support Spark native engine in Iceberg #9721

Closed

sunchao mentioned this pull request Mar 7, 2024

Explore integration with Delta Lake apache/datafusion-comet#174

Open

RussellSpitzer reviewed Apr 2, 2024

View reviewed changes

...k/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergColumnReader.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Apr 16, 2024

View reviewed changes

github-actions bot added the API label Apr 18, 2024

RussellSpitzer reviewed Apr 18, 2024

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkConfParser.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 18, 2024

View reviewed changes

...v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/BaseColumnBatchLoader.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 18, 2024

View reviewed changes

....4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometColumnReader.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Apr 23, 2024

View reviewed changes

github-actions bot removed the API label Apr 26, 2024

aokolnychyi reviewed Apr 30, 2024

View reviewed changes

aokolnychyi reviewed May 3, 2024

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java Outdated Show resolved Hide resolved

aokolnychyi reviewed May 9, 2024

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/BatchReadConf.java Outdated Show resolved Hide resolved

huaxingao closed this Jun 20, 2024

huaxingao reopened this Jun 20, 2024

huaxingao mentioned this pull request Nov 15, 2024

Missing ColumnarToRow when using CometSparkToColumnar apache/datafusion-comet#1092

Open

github-actions bot added the stale label Nov 29, 2024

Huaxin Gao and others added 15 commits December 3, 2024 18:45

Iceberg/Comet integration

b2e2f96

address comments

658ab74

address comments

10cdf24

remove unnecessary code

13eb696

address comments

147615b

address comments

c43eb28

remove unnecessary public

4c01611

address comments

179bf76

address comments

3a759a6

minor changes

6d9017c

update to use comet 0.3.0

beffcf8

use the new Comet Utils.getColumnReader method

b51b242

change PARQUET_READER_TYPE_DEFAULT to Comet to test CometReader

fc10a51

Ignore SmokeTest#testGettingStarted for now

dbe1922

rebase

41ec616

huaxingao force-pushed the comet3 branch from 3e96fd5 to 41ec616 Compare December 4, 2024 03:29

huaxingao added 3 commits December 4, 2024 11:22

add setRowGroupInfo(PageReadStore pageStore, Map<ColumnPath, ColumnCh…

23b3b6b

…unkMetaData> metaData)

formatting

7cba1ec

ignore a few tests for now

c2d9a90

github-actions bot removed the stale label Dec 5, 2024

huaxingao commented Dec 18, 2024

View reviewed changes

huaxingao added 2 commits December 26, 2024 10:35

remove comet dependency in build.gradle

0bc298d

Trigger Build

5e37a79

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg/Comet integration POC #9841

Iceberg/Comet integration POC #9841

huaxingao commented Mar 1, 2024 •

edited

Loading

huaxingao commented Mar 1, 2024

aokolnychyi left a comment

aokolnychyi Apr 16, 2024

huaxingao Apr 16, 2024

RussellSpitzer Apr 18, 2024

huaxingao Apr 21, 2024

aokolnychyi Apr 22, 2024

aokolnychyi Apr 22, 2024

aokolnychyi Apr 22, 2024

aokolnychyi Apr 23, 2024

aokolnychyi Apr 30, 2024

huaxingao commented Apr 29, 2024

aokolnychyi commented Apr 30, 2024

aokolnychyi Apr 30, 2024

cornelcreanga commented Jun 20, 2024

huaxingao commented Jun 20, 2024

PaulLiang1 commented Sep 4, 2024

huaxingao commented Sep 4, 2024

PaulLiang1 commented Sep 4, 2024

huaxingao commented Sep 4, 2024

huaxingao commented Sep 5, 2024

PaulLiang1 commented Sep 5, 2024

github-actions bot commented Nov 29, 2024

huaxingao commented Dec 6, 2024

huaxingao Dec 18, 2024

dpengpeng commented Dec 19, 2024 •

edited

Loading

huaxingao commented Dec 19, 2024

parthchandra commented Dec 19, 2024

		}

		compileOnly "org.apache.comet:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.1.0-SNAPSHOT"

Iceberg/Comet integration POC #9841

Are you sure you want to change the base?

Iceberg/Comet integration POC #9841

Conversation

huaxingao commented Mar 1, 2024 • edited Loading

huaxingao commented Mar 1, 2024

aokolnychyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Apr 29, 2024

aokolnychyi commented Apr 30, 2024

Choose a reason for hiding this comment

cornelcreanga commented Jun 20, 2024

huaxingao commented Jun 20, 2024

PaulLiang1 commented Sep 4, 2024

huaxingao commented Sep 4, 2024

PaulLiang1 commented Sep 4, 2024

huaxingao commented Sep 4, 2024

huaxingao commented Sep 5, 2024

PaulLiang1 commented Sep 5, 2024

github-actions bot commented Nov 29, 2024

huaxingao commented Dec 6, 2024

Choose a reason for hiding this comment

dpengpeng commented Dec 19, 2024 • edited Loading

huaxingao commented Dec 19, 2024

parthchandra commented Dec 19, 2024

huaxingao commented Mar 1, 2024 •

edited

Loading

dpengpeng commented Dec 19, 2024 •

edited

Loading