Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iceberg/Comet integration POC #9841

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

Iceberg/Comet integration POC #9841

wants to merge 20 commits into from

Conversation

huaxingao
Copy link
Contributor

@huaxingao huaxingao commented Mar 1, 2024

This PR shows how I will integrate Comet with iceberg. The PR doesn't compile yet because we haven't released Comet yet, but it shows the ideas how we are going to change iceberg code to integrate Comet. Also, Comet doesn't have Spark3.5 support yet so I am doing this on 3.4, but we will add 3.5 support in Comet.

In VectorizedSparkParquetReaders.buildReader, if Comet library is available, a CometIcebergColumnarBatchReader will be created, which will use Comet batch reader to read data. We can also add a property later to control whether we want to use Comet or not.

The logic in CometIcebergVectorizedReaderBuilder is very similar to VectorizedReaderBuilder. It builds Comet column reader instead of iceberg column reader.

The delete logic in CometIcebergColumnarBatchReader is exactly the same as the one in ColumnarBatchReader. I will extract the common code and put the common code in a base class.

The main motivation of this PR is to improve performance using native execution. Comet's Parquet reader is a hybrid implementation: IO and decompression are done in the JVM while decoding is done natively. There is some performance gain from native decoding, but the gain is not much. However, by switching to the Comet Parquet reader, Comet will recognize that this is a Comet scan and will convert the Spark physical plan into a Comet plan for native execution. The major performance gain will be from this native execution.

@huaxingao
Copy link
Contributor Author

cc @aokolnychyi @sunchao

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right direction to take. I did an initial high-level pass. Looking forward to having a Comet release soon.

}

compileOnly "org.apache.comet:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.1.0-SNAPSHOT"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this library will only contain the reader, not the operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This only contains the reader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be Spark Version Dependent? Just wondering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently doing some experiments to see if we can provide a Spark Version independent jar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for exploring that.

@github-actions github-actions bot added the API label Apr 18, 2024
api/src/main/java/org/apache/iceberg/ReaderType.java Outdated Show resolved Hide resolved
build.gradle Outdated
@@ -45,6 +45,7 @@ buildscript {
}
}

String sparkMajorVersion = '3.4'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we can soon have a snapshot for Comet jar independent of Spark to clean up deps here.
We can't have parquet module depend on a jar with any Spark deps.

spark/v3.4/build.gradle Outdated Show resolved Hide resolved
}

compileOnly "org.apache.comet:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.1.0-SNAPSHOT"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for exploring that.

gradle.properties Outdated Show resolved Hide resolved
import org.apache.spark.sql.vectorized.ColumnVector;
import org.apache.spark.sql.vectorized.ColumnarBatch;

@SuppressWarnings("checkstyle:VisibilityModifier")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes would require a bit more time to review. I'll do that tomorrow. I think we would want to restructure the original implementation a bit. Not a concern for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would want to structure this a bit differently. Let me think more.

@github-actions github-actions bot removed the API label Apr 26, 2024
@huaxingao
Copy link
Contributor Author

@aokolnychyi I have addressed the comments. Could you please take one more look when you have a moment? Thanks a lot!

@aokolnychyi
Copy link
Contributor

Will check today.

import org.apache.spark.sql.vectorized.ColumnVector;
import org.apache.spark.sql.vectorized.ColumnarBatch;

@SuppressWarnings("checkstyle:VisibilityModifier")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would want to structure this a bit differently. Let me think more.

@cornelcreanga
Copy link

@huaxingao - Hi, is the Comet Parquet reader able to support page skipping/use page indexes? -eg see #193 for the Iceberg Parquet reader initial issue.

@huaxingao
Copy link
Contributor Author

@cornelcreanga Comet Parquet reader doesn't support page skipping yet

@huaxingao huaxingao closed this Jun 20, 2024
@huaxingao huaxingao reopened this Jun 20, 2024
@PaulLiang1
Copy link

hey @huaxingao
we are really interested in this feature, just wonder what can we help to getting this integrated?

@huaxingao
Copy link
Contributor Author

@PaulLiang1 Thank you for your interest! We are currently working on a binary release of DataFusion Comet. Once the binary release is available, I will proceed with this PR.

@PaulLiang1
Copy link

@huaxingao
I think we got a internal version of building DataFusion comet and publish a JAR internally.
Is there anything we can help with on that front?

Thanks

@huaxingao
Copy link
Contributor Author

@PaulLiang1 Thanks! I'll check with my colleague tomorrow to find out where we are in the binary release process.

@huaxingao
Copy link
Contributor Author

@PaulLiang1 We are pretty close to this and will have a binary release for Comet soon.

@PaulLiang1
Copy link

@PaulLiang1 Thanks! I'll check with my colleague tomorrow to find out where we are in the binary release process.

got it, thanks for letting me know. please feel free to let us know if there is anything we could help on. thanks!

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 29, 2024
@github-actions github-actions bot removed the stale label Dec 5, 2024
@huaxingao
Copy link
Contributor Author

@aokolnychyi Could you please take a look again? I have changed the default to Comet to make sure all the tests run successfully with Comet. I will switch back to the regular iceberg vectorization.

Also @szehon-ho

@@ -214,7 +215,7 @@ public void testWriteWithCaseSensitiveOption() throws NoSuchTableException, Pars
Assert.assertEquals(4, fields.size());
}

@Test
@Ignore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ignoring three tests for now with the Comet configuration. This is only for testing purposes within Comet. I will revert these changes after the PR review.

In the followup PR, I will make SparkScan implement org.apache.comet.parquet.SupportsComet, and then Comet will check the flag in org.apache.comet.parquet.SupportsComet to turn on the native operators. Additionally, Comet will allow conversion from int to long and float to double based on the flag in org.apache.comet.parquet.SupportsComet, which means, these tests will pass after the followup PR.

@dpengpeng
Copy link

dpengpeng commented Dec 19, 2024

@huaxingao I have used Comet on Spark and the results are very good. Now I see Iceberg and Comet working together, which is a great attempt. I look forward to Iceberg merging this POC code as soon as possible. In addition, does the current POC code support Comet writing Iceberg data?

@huaxingao
Copy link
Contributor Author

@dpengpeng Thank you for your positive feedback! I’m glad to hear you’ve had good results with Comet on Spark. The current POC doesn't support writing Iceberg data yet, but we plan to add this capability in the future.

@parthchandra
Copy link

@aokolnychyi @RussellSpitzer @szehon-ho Is this getting closer to completion? asking for a friend :)
Also, the Comet team is looking at providing complex type support in the reader and our choices are being heavily influenced by this PR so it would be good if we know what APIs we must guarantee.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants