Introduce IcebergPageSource and IcebergColumnHandle #1675

lxynov · 2019-10-05T00:08:01Z

This PR is ready for review. Below are some explanations upon it.

Read side

Partition key columns and regular columns
- Partition key columns are those which are the source of identity partitioning. They are special in Iceberg because data files may not store values for them. Especially, for tables migrated from Hive, data files won't have data for partitioning columns.
- Regular columns are those which are not partition key columns.
Partition spec evolution in Iceberg
- Iceberg tables support partition spec evolution, which means partition key columns may change over time, thereby different data files may have different partition key columns.
Logic on the read side
- Partition key columns are determined at the split (file) level.
- For each split, a list of partition keys is computed from metadata. Each partition key is a pair of iceberg ID and partition value.
- IcebergPageSource prefills partition key columns and gets regular column values from a delegate.
- Partition values are fetched from Iceberg metadata files, serialized into strings and passed to IcebergPageSource. IcebergPageSource deserializes them into Presto objects. Note that in Iceberg, only columns with primitive types can be used as the source of partitioning.

Write side

There aren't many changes on the write side.
Note that on the read side we don't need to convert Presto types to Hive types any more because we've got rid of HiveColumnHandle. But on the write side we have to still do this since we use Hive's Parquet writer in Presto.
Both row delete and partition delete are not supported yet so I removed the testDelete test in TestIcebergDistributed. (Originally Presto throws an exception saying "This connector only supports delete where one or more partitions are deleted entirely" for a delete operation upon Iceberg tables. But actually, even partition delete is not supported)`

Issue: #1655
Umbrella issue: #1324
cc: @wagnermarkd @electrum @Parth-Brahmbhatt

presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/IcebergColumnHandle.java

presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/IcebergPageSourceProvider.java

presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/IcebergSplitSource.java

presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/IcebergPageSource.java

presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/IcebergSplitSource.java

phd3 · 2019-10-17T16:52:50Z

presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/IcebergSplitSource.java

            }
-            partitionKeys.add(new HivePartitionKey(name, partitionValue));


how about storing a deserialized presto type and value object here itself? it reduces the propagation of untyped values a bit

This is a good idea but for now the value stored in IcebergPartitionKey is not technically a "presto" value. It's more an "Iceberg" value. The deserializer (IcebergPageSource#deserializePartitionValue) not only deserializes but also converts values from Iceberg representation to Presto representation.

I think we should do the conversion firstly and then serialize/deserialize Presto values. But I prefer to do it in a future separate PR. Because the value conversion also relates to DomainConverter and ExpressionConverter. We should do a uniform refactoring in the future.

lxynov · 2019-10-21T22:21:53Z

@phd3 Thank you for the review! I've updated this PR.

@electrum Could you help review this? Hope we can get this merged sooner to avoid future conflicts with the master branch.

electrum

A few minor comments, otherwise looks good. Thanks for cleaning this up and fixing partition handling.

Note that I merged #1561 but that should be a one line change in IcebergMetadata.applyFilter when you rebase.

electrum · 2019-10-11T19:10:44Z

presto-iceberg/src/test/java/io/prestosql/plugin/iceberg/TestIcebergDistributed.java

@@ -35,7 +35,7 @@ protected boolean supportsViews()
    @Override
    public void testDelete()
    {
-        assertQueryFails("DELETE FROM orders WHERE orderkey % 2 = 0", "This connector only supports delete where one or more partitions are deleted entirely");


IcebergMetadata still implements delete. What happens if you run it after this PR?

It throws an exception This connector does not support updates or deletes from the default ConnectorMetadata##getUpdateRowIdColumnHandle.

presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/IcebergColumnHandle.java

electrum · 2019-10-22T04:17:50Z

presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/IcebergPageSink.java

@@ -93,6 +95,7 @@
 public class IcebergPageSink
        implements ConnectorPageSink
 {
+    private static final TypeTranslator TYPE_TRANSLATOR = new HiveTypeTranslator();


Let's fork this into Iceberg so that we can remove the TIMESTAMP_WITH_TIME_ZONE hack. It can be a static utility method. We should also be able to remove the binding in IcebergModule.

Done. I've added a static utility method to TypeConverter.

presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/IcebergPartitionKey.java

presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/IcebergPageSourceProvider.java

presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/IcebergPageSource.java

electrum · 2019-10-22T05:36:04Z

presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/IcebergPageSource.java

+        }
+    }
+
+    private static Object deserializePartitionValue(Type type, String valueString, String name, TimeZoneKey timeZoneKey)


I feel like there should be a better way to do this, especially the decimal part, but this seems like the best we can do here for now.

The parameter constraint in PartitionTable#cursor is targetted at columns in the final partition table, rather than the original data table.

lxynov · 2019-10-22T22:11:40Z

@electrum Thank you for the feedback. I've addressed the comments.

This commit simplifies logic and fixing bugs in case of Iceberg Partition Spec evolution.

electrum · 2019-10-23T16:59:47Z

Merged, thanks!

cla-bot bot added the cla-signed label Oct 5, 2019

lxynov added the WIP label Oct 5, 2019

lxynov force-pushed the iceberg-simplification branch from 22f1062 to e314705 Compare October 7, 2019 20:55

lxynov removed the WIP label Oct 7, 2019

lxynov force-pushed the iceberg-simplification branch 2 times, most recently from 10c214a to ecaba2f Compare October 7, 2019 23:34

phd3 reviewed Oct 17, 2019

View reviewed changes

lxynov force-pushed the iceberg-simplification branch from ecaba2f to 46f8387 Compare October 21, 2019 22:19

electrum approved these changes Oct 22, 2019

View reviewed changes

Remove unnecessary predication pushdown

06aca1e

The parameter constraint in PartitionTable#cursor is targetted at columns in the final partition table, rather than the original data table.

lxynov force-pushed the iceberg-simplification branch from 46f8387 to 7b29cd8 Compare October 22, 2019 22:06

Introduce IcebergPageSource and IcebergColumnHandle

2ea9bce

This commit simplifies logic and fixing bugs in case of Iceberg Partition Spec evolution.

lxynov force-pushed the iceberg-simplification branch from 7b29cd8 to 2ea9bce Compare October 22, 2019 22:34

lxynov changed the title ~~Introduce IcebergPageSource, IcebergColumnHandle and IcebergPartitionKey~~ Introduce IcebergPageSource and IcebergColumnHandle Oct 22, 2019

electrum merged commit f9e1dae into trinodb:master Oct 23, 2019

lxynov deleted the iceberg-simplification branch October 23, 2019 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce IcebergPageSource and IcebergColumnHandle #1675

Introduce IcebergPageSource and IcebergColumnHandle #1675

lxynov commented Oct 5, 2019 •

edited

Loading

phd3 Oct 17, 2019

lxynov Oct 21, 2019

lxynov commented Oct 21, 2019

electrum left a comment

electrum Oct 11, 2019

lxynov Oct 22, 2019

electrum Oct 22, 2019

lxynov Oct 22, 2019

electrum Oct 22, 2019

lxynov commented Oct 22, 2019

electrum commented Oct 23, 2019

		}
		partitionKeys.add(new HivePartitionKey(name, partitionValue));

Introduce IcebergPageSource and IcebergColumnHandle #1675

Introduce IcebergPageSource and IcebergColumnHandle #1675

Conversation

lxynov commented Oct 5, 2019 • edited Loading

Read side

Write side

phd3 Oct 17, 2019

Choose a reason for hiding this comment

lxynov Oct 21, 2019

Choose a reason for hiding this comment

lxynov commented Oct 21, 2019

electrum left a comment

Choose a reason for hiding this comment

electrum Oct 11, 2019

Choose a reason for hiding this comment

lxynov Oct 22, 2019

Choose a reason for hiding this comment

electrum Oct 22, 2019

Choose a reason for hiding this comment

lxynov Oct 22, 2019

Choose a reason for hiding this comment

electrum Oct 22, 2019

Choose a reason for hiding this comment

lxynov commented Oct 22, 2019

electrum commented Oct 23, 2019

lxynov commented Oct 5, 2019 •

edited

Loading