feat: Supports UUID column #395

huaxingao · 2024-05-07T15:31:15Z

Which issue does this PR close?

Closes #.

Rationale for this change

Supports UUID column. This is for Iceberg/Comet integration

What changes are included in this PR?

How are these changes tested?

This has been tested locally using iceberg

comphead

thanks @huaxingao should we get this covered by tests?

huaxingao · 2024-05-07T15:57:10Z

should we get this covered by tests

This is for iceberg/comet integration. I don't think there is an easy way to test this now. I have tested this on my local, though.

comphead · 2024-05-07T16:13:53Z

should we get this covered by tests

This is for iceberg/comet integration. I don't think there is an easy way to test this now. I have tested this on my local, though.

Maybe we can return uuid value and assert it somehow, although its non-determenistic? I'm thinking if we can be protected from regression if anyone else changes this code later?

huaxingao · 2024-05-07T16:29:02Z

Maybe we can return uuid value and assert it somehow, although its non-determenistic? I'm thinking if we can be protected from regression if anyone else changes this code later?

I am thinking of adding Iceberg tests in Comet after the Iceberg/Comet integration is complete, to ensure that the changes won't regress.

huaxingao · 2024-05-07T20:52:02Z

cc @viirya

viirya · 2024-05-09T00:40:02Z

common/src/main/java/org/apache/comet/parquet/ColumnReader.java

@@ -169,6 +170,7 @@ public void close() {

  /** Returns a decoded {@link CometDecodedVector Comet vector}. */
  public CometDecodedVector loadVector() {
+


Unnecessary change.

removed extra line

viirya · 2024-05-09T00:47:06Z

common/src/main/java/org/apache/comet/parquet/ColumnReader.java

@@ -207,6 +214,7 @@ public CometDecodedVector loadVector() {
      DictionaryEncoding dictionaryEncoding = vector.getField().getDictionary();

      CometPlainVector cometVector = new CometPlainVector(vector, useDecimal128);
+      cometVector.setIsUuid(isUuid);


Why not put it into constructor parameter list?

added isUuid in constructor parameter list

codecov-commenter · 2024-05-09T05:28:34Z

Codecov Report

Attention: Patch coverage is 23.52941% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 34.18%. Comparing base (9ab6c75) to head (1f3fa2d).
Report is 65 commits behind head on main.

Files	Patch %	Lines
...in/java/org/apache/comet/parquet/ColumnReader.java	0.00%	5 Missing ⚠️
...org/apache/comet/vector/CometDictionaryVector.java	0.00%	3 Missing ⚠️
...java/org/apache/comet/vector/CometPlainVector.java	50.00%	3 Missing ⚠️
...va/org/apache/comet/vector/CometDecodedVector.java	33.33%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #395      +/-   ##
============================================
+ Coverage     33.47%   34.18%   +0.70%     
- Complexity      795      851      +56     
============================================
  Files           110      116       +6     
  Lines         37533    38545    +1012     
  Branches       8215     8521     +306     
============================================
+ Hits          12563    13175     +612     
- Misses        22322    22606     +284     
- Partials       2648     2764     +116

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

viirya · 2024-05-09T06:07:25Z

common/src/main/java/org/apache/comet/vector/CometPlainVector.java

+        return UTF8String.fromBytes(result);
+      } else {
+        return UTF8String.fromString(convertToUuid(result).toString());
+      }


For test, I think if it is possible to create a Parquet file with uuid column in unit test and read it? Due to this change the column should be read as uuid instead of string.

Spark doesn't support uuid data type. Can we create a table with uuid column?

We created parquet files in makeParquetFileAllTypes, for example. It uses parquet writer to directly write parquet files instead of using Dataset/DataFrame API. So you don't need to have uuid column in a table.

+1 for create a parquet file with UUID logical annotation.

Or, maybe we can have an iceberg-integration module in the comet project and include iceberg as a dep in that module. We can generate iceberg parquet files with UUID column in that module directly and then test there.

I looked makeParquetFileAllTypes, seems I can only use parquet type INT64, INT32, BOOLEAN, BINARY, FLOAT, DOUBLE, INT96, FIXED_LEN_BYTE_ARRAY. It doesn't seem I can use a UUID logical type in a Parquet schema. When I did my local test using iceberg, I was using iceberg's UUIDType

Might be something wrong with the UUID value. I will check

I fixed the UUID data problem, but now I got illegalParquetTypeError from Spark. I don't think Spark supports Parquet's UUID. Iceberg maps UUID to UTF8String

Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.50.140 executor driver): org.apache.spark.sql.AnalysisException: Illegal Parquet type: FIXED_LEN_BYTE_ARRAY (UUID). at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1762) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:206) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertPrimitiveField$2(ParquetSchemaConverter.scala:310) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:224) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:187) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3(ParquetSchemaConverter.scala:147) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3$adapted(ParquetSchemaConverter.scala:117) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.immutable.Range.foreach(Range.scala:158) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertInternal(ParquetSchemaConverter.scala:117) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:87) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:493) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:493) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:473) at scala.collection.immutable.Stream.map(Stream.scala:418) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:473) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:464)

Hmm, okay, we still use Spark ParquetToSparkSchemaConverter to convert Parquet schema to Spark schema? I thought that we may have some custom one in Comet.

It failed here. It's Spark code, not Comet code yet. That's why it uses Spark's ParquetToSparkSchemaConverter

The failed one is Spark reads Parquet file using its data source. Can you try to read it with Comet? If Comet scan doesn't use ParquetToSparkSchemaConverter, I think we won't hit the above error?

viirya

The change looks okay. Just wondering if it is possible to add a test.

huaxingao · 2024-05-19T00:38:11Z

common/src/main/java/org/apache/comet/parquet/CometParquetToSparkSchemaConverter.scala

+        typeAnnotation match {
+          case _: DecimalLogicalTypeAnnotation =>
+            makeDecimalType(Decimal.maxPrecisionForBytes(parquetType.getTypeLength))
+          case _: UUIDLogicalTypeAnnotation => StringType


All the code are copied from Spark's ParquetToSparkSchemaConverter except this line.

Are we adding this just to test? Or is this likely to be useful in other places?
Also, instead of copying, could we not just extend the Spark class and override convertField? WE can then call our impl of convertPrimitiveField for UUID and let the parent implementation handle the rest?
We are likely to miss changes made in Spark if we make a copy.

The reason that I didn't override convertField is that it returns ParquetColumn, which is only exists in Spark3.3 and Spark3.4. Since we need to make Spark 3.2 work, so I made our own version of CometParquetColumn and was trying to make it work for Spark 3.2 too. It actually took quite some effort to make CometParquetColumn to work for all three version of Spark, because ParquetColumn uses different methods for Spark3.3 (which uses parquet 1.12.x) from Spark 3.4 (which uses parquet 1.13.x).

I had an offline discussion with @viirya. We will merge this PR without a test for now. After iceberg integration is done, we can probably add some iceberg test with uuid.

I will remove the tests for now.

viirya · 2024-05-21T16:11:51Z

Merged. Thanks @huaxingao @advancedxy @comphead @parthchandra

huaxingao · 2024-05-21T16:12:19Z

Thanks, everyone!

* fix uuid * address comments --------- Co-authored-by: Huaxin Gao <huaxin.gao@apple.com> (cherry picked from commit 7b0a7e0)

comphead reviewed May 7, 2024

View reviewed changes

viirya reviewed May 9, 2024

View reviewed changes

huaxingao commented May 19, 2024

View reviewed changes

Huaxin Gao added 2 commits May 18, 2024 17:57

fix uuid

27716ee

address comments

6e3cc14

huaxingao force-pushed the uuid branch 2 times, most recently from 6d87d34 to 6e3cc14 Compare May 20, 2024 23:55

viirya approved these changes May 21, 2024

View reviewed changes

viirya merged commit 7b0a7e0 into apache:main May 21, 2024
40 checks passed

huaxingao deleted the uuid branch May 21, 2024 16:12

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024

feat: Supports UUID column (apache#395)

7fba7d8

* fix uuid * address comments --------- Co-authored-by: Huaxin Gao <huaxin.gao@apple.com> (cherry picked from commit 7b0a7e0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Supports UUID column #395

feat: Supports UUID column #395

huaxingao commented May 7, 2024

comphead left a comment

huaxingao commented May 7, 2024

comphead commented May 7, 2024

huaxingao commented May 7, 2024

huaxingao commented May 7, 2024

viirya May 9, 2024

huaxingao May 9, 2024 •

edited

Loading

viirya May 9, 2024

huaxingao May 9, 2024

codecov-commenter commented May 9, 2024

viirya May 9, 2024

huaxingao May 9, 2024

viirya May 9, 2024

advancedxy May 14, 2024

huaxingao May 15, 2024

huaxingao May 15, 2024

huaxingao May 15, 2024

viirya May 15, 2024

huaxingao May 16, 2024

viirya May 16, 2024

viirya left a comment

huaxingao May 19, 2024

parthchandra May 20, 2024

huaxingao May 20, 2024

viirya commented May 21, 2024

huaxingao commented May 21, 2024

		@@ -169,6 +170,7 @@ public void close() {

		/** Returns a decoded {@link CometDecodedVector Comet vector}. */
		public CometDecodedVector loadVector() {

feat: Supports UUID column #395

feat: Supports UUID column #395

Conversation

huaxingao commented May 7, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

comphead left a comment

Choose a reason for hiding this comment

huaxingao commented May 7, 2024

comphead commented May 7, 2024

huaxingao commented May 7, 2024

huaxingao commented May 7, 2024

Choose a reason for hiding this comment

huaxingao May 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented May 9, 2024

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented May 21, 2024

huaxingao commented May 21, 2024

huaxingao May 9, 2024 •

edited

Loading