Iceberg Connector #1324

lxynov · 2019-08-19T18:03:23Z

TODOs for the Iceberg Connector

Update the README to reflect the current status, and convert it to proper connector documentation before announcing the connector as ready for use (Iceberg connector documentation #4537, Remove Iceberg README.md #5887)
Lower case all field names read from Iceberg metadata files.
Fix table listing to skip non-Iceberg tables. This will need a new metastore method to list tables filtered on a property name, similar to how view listing works in ThriftHiveMetastore.
Predicate pushdown is currently broken, which means delete is also broken. The code from the original getTableLayouts() implementation needs to be updated for applyFilter().
Delete is broken and should be fixed. Note that unlike Hive Connector, Iceberg Connector should support row-by-row deletion.
All of the HdfsContext calls that use /tmp need to be fixed.
HiveConfig needs to be removed. We might need to split out separate config classes in the Hive connector for the components that are reused in Iceberg.
We should try to remove HiveColumnHandle. This will require replacing or abstracting HivePageSource, which is currently used to handle schema evolution and prefilled column values (identity partitions).
Writing of decimals and timestamps is broken, since their representation in Parquet seems to be different for Iceberg and Hive. Reads are probably also broken, but this isn't tested yet since writes don't work. We will need a native Parquet writer to fix this.
~~UUID type is not implemented and will be dropped from the Iceberg specification.~~
Support Iceberg's UUID in the Iceberg connector #6663
Implement time type.
Partition table
History table
Snapshots table
Manifests table
Files table
Return table statistics so CBO can leverage them.
Add implementation and tests for table comments.
Add implementation and tests for column comments.
Needs complete tests for all data types and all partitioning transforms.
Needs integration tests (probably as product tests) for interoperability with Spark in both directions (write Spark -> read Presto, write Presto -> read Spark).
Needs correctness tests for partition pruning. (also validate the pushdown is happening by checking the query plans?) Iceberg: test partition pruning #2660
Add tests for CREATE TABLE LIKE.
Add test for creating NOT NULL columns.
Add tests for non-Iceberg tables: listing tables in a schema, listing columns in a schema, describing a table, selecting from a table) Add Iceberg tests for Hive tables in the same metastore #5459
Add product tests: Add Iceberg product tests with HDFS and metastore impersonation #2304
Add a procedure to migrate Hive tables to Iceberg #13196
Determine and support appropriate schema evolution semantics for Iceberg table with legacy Hive files #9843
Add procedure for rollback table to snapshot.
ORC support Iceberg Connector ORC support #2042
Add support for AVRO in Iceberg #12125
Allow querying Iceberg table by its location, without registering it in metastore #2298
NOT NULL enforcement
location or external_location table property Iceberg integration: allow to specify LOCATION property on CREATE TABLE #2501
Use metastore locking around read-modify-write operations for transaction commit Data loss due to lack of commit orchestration in Iceberg #9583
Iceberg commit retries Iceberg commit retries #9582
Add tests for truncate on numeric types Add Iceberg tests for truncate on numeric types #5456
Add tests for partition transforms on structured types Add Iceberg tests for partition transforms on structured types #5458
Add tests for Hive tables in the same metastore Add Iceberg tests for Hive tables in the same metastore #5459
Dereference Pushdown for Iceberg Connector Dereference Pushdown for Iceberg Connector #5179
Flaky test TestIcebergCreateTable.testCreateTable Flaky test TestIcebergCreateTable.testCreateTable #4864
Add support for partition evolution Add support for partition evolution in Iceberg. #7580
- Trino cannot read an Iceberg table that has dropped a partition field Trino cannot read an Iceberg table that has dropped a partition field #8284
Test bucketing consistency and stability, like Hive's TestHiveBucketing.
Support predicate pushdown and metadata deletion for non-partition columns Support predicate pushdown and deletion for non-identity partition columns in Iceberg #7905
Run Iceberg product tests with all tested Hive distributions Run Iceberg product tests with all tested Hive distributions #7898
Improve test coverage around partitioned tables and $partition system table Improve test coverage for partitioned tables and partition system table in Iceberg #7972
Add support for Trino views in Iceberg connector Add support for Trino views in Iceberg connector #8540
Add support for Iceberg void transform #8623
Fix reading of specific Iceberg snapshots Fix reading of specific Iceberg snapshots #8663
Accessing non-existent Iceberg system table should result in "table not found" #8690
Properly reject Iceberg tables in Hive connector #8693
Support use-preferred-write-partitioning for the Iceberg connector Support use-preferred-write-partitioning for the Iceberg connector #8682
Improve performance of Iceberg decimal bucket transform Improve performance of Iceberg decimal bucket transform #8724
Unexpected results when reading Iceberg Parquet table after nested field schema evolved Unexpected results when reading Iceberg Parquet table after nested field schema evolved #8750
Evaluate Apache Iceberg's support for predicate on structural types Evaluate Apache Iceberg's support for predicate on structural types #8759
Add $file hidden column in Iceberg connector Add $path hidden column in Iceberg connector #8769
Populate split_offsets in Iceberg data files Populate split_offsets in Iceberg metadata #9018
Iceberg partition pruning does not work for predicates not expressible by tuple domain Iceberg partition pruning does not work for predicates not expressible by tuple domain #9309
Support dynamic filtering in Iceberg connector #4115
Support Glue metastore in Iceberg connector Support Glue metastore in Iceberg connector #9363
Excessive metastore invocations when querying Iceberg table Excessive metastore invocations when querying Iceberg table #8675
Reject Hive configuration properties that have no meaning for Iceberg Reject Hive configuration properties that have no meaning for Iceberg #9607
IcebergSplitSource: Support large IN predicates IcebergSplitSource: Support large IN predicates #9743
Incorrect query results for Iceberg table partitioned on varbinary / binary Incorrect query results for Iceberg table partitioned on varbinary / binary #9755
Revamp Iceberg statistics reporting Iceberg table statistics are non-deterministic #9716
SHOW STATS fails with NPE when Iceberg file has no columns with stats SHOW STATS fails with NPE when Iceberg file has no columns with stats #9714
SHOW STATS fails if Iceberg metadata has no statistics for a file SHOW STATS fails if Iceberg metadata has no statistics for a file #9707
Incorrect values returned when Iceberg table partitioned by timestamp with time zone Incorrect values returned when Iceberg table partitioned by timestamp with time zone #9704
Query failure when reading from $partitions when Iceberg table partitioned on timestamp with time zone Query failure when reading from $partitions when Iceberg table partitioned on timestamp with time zone #9703
Garbage return value from Iceberg $partitions for varbinary non-partition column Garbage return value from Iceberg $partitions for varbinary non-partition column #9756
IcebergSplitSource throws away CombinedScanTask combinations IcebergSplitSource throws away CombinedScanTask combinations #8486
Iceberg Connector ORC writer writes incorrect file size #9810
Improve DeterminePreferredWritePartitioning for projection-based partitioning like in Iceberg #9852
Feature Request: support for Iceberg Dynamodb catalogues #9953
Feature request: Support for jdbc catalog in Iceberg Connector #9968
Support tinyint and smallint when reading Iceberg ORC files #8919
Use ZSTD by default in Iceberg #10058
Support Metrics mode when creating/writing Iceberg tables #9791
Add support to redirect table reads from Hive to Iceberg #10173
Add support to redirect table reads from Iceberg to Hive #10245
support of iceberg v2 format #10758
Respect hive.target-max-file-size in Iceberg #10786
Skip reading Parquet pages using Column Indexes for Iceberg #11000
Support migrating Iceberg v1 tables to v2 #12138
Implement UPDATE for the Iceberg connector #12026
Ability to OPTIMIZE a single time-based partition in Iceberg #12362
Support Iceberg time travel #10258
Iceberg snapshot queries use the latest schema of the table #12743
Filter Iceberg splits based on $path column predicates #12785
Iceberg table name on s3 #5632
Skip Glue archive in Iceberg table commits #13413

The text was updated successfully, but these errors were encountered:

manishmalhotrawork · 2019-08-19T22:31:20Z

@linxingyuan1102 should it also be a TODO for:

"Iceberg table should also allow to give table location?"
as its possible that, from same presto cluster I want to create tables pointing to different S3 account/clusters.

manishmalhotrawork · 2020-01-28T23:01:36Z

@lxynov #2660
added issue for ToDo
"Needs correctness tests for partition pruning. (also validate the pushdown is happening by checking the query plans?)"
Can you please link the issue with the todo.

lxynov · 2020-01-29T20:37:17Z

@manishmalhotrawork sure, done

Just a note, partition pruning in Iceberg is tricky because of partition spec evolution. We need more thoughts and discussion on this.

AbdullaevAPo · 2020-10-13T08:35:28Z

@lxynov Is it planned to add support of hdfs only iceberg tables (like in spark https://iceberg.apache.org/spark/ &spark.sql.catalog.hadoop_prod.type = hadoop ) ?

pPanda-beta · 2021-01-13T20:25:36Z

@lxynov
any update on what @AbdullaevAPo asked?
#1324 (comment)

This feature is a blocker to perfect read-write isolation, having a hive-metastore as a common point of contact between spark and presto is not a scalable solution.

pan3793 · 2021-03-15T01:25:03Z

Is there any plan for supporting HadoopCatalog?

caneGuy · 2021-06-25T09:20:31Z

https://github.com/trinodb/trino/pull/6977/files @pan3793 i think this is related work

KarlManong · 2021-07-09T06:48:55Z

Will you support table configuration properties ?

RomantsovArtur · 2021-12-14T14:02:30Z

Hey.

I don't see in the list support of the UPDATE or CHANGE statement for ALTER TABLE. It would be very handy, since data evolves a lot.

I can see that the functionality exists in the IcerbergAPI: https://iceberg.apache.org/javadoc/master/org/apache/iceberg/UpdateSchema.html

I might be missing something.

bitsondatadev · 2022-01-21T16:31:11Z

https://iceberg.apache.org/javadoc/master/org/apache/iceberg/UpdateSchema.html

@RomantsovArtur, for schema evolution in Trino you can use ALTER TABLE <table-name> ADD|DROP COLUMN ...

See my section on Schema Evolution in this blog: https://blog.starburst.io/trino-on-ice-ii-in-place-table-evolution-and-cloud-compatibility-with-iceberg

If you're looking for updates for partition evolution, we are already tracking #7580 here.

Feel free to reach out to me on Trino Slack if you're looking for something specific.

RomantsovArtur · 2022-01-21T16:53:43Z

@bitsondatadev Thank you for your reply!

We are looking for some logic like:
ALTER TABLE table_name CHANGE [COLUMN] col_name column_new_type

As you can see from the link I provided above - Iceberg API is available, but, unfortunately, Trino does not support this logic.

I read the doc you attached. Thank you for the beautiful blog post. The use case we are trying to achieve is the case when you have a table that is constantly written to and read by different clients, and we want to have an atomic type update rather than

Add a new column to the table
Update new column on converted value from the old column (I think Trino doesn't support update yet)
Rename old column or drop the old column
Rename the new column to the old column name

Please note that I'm speaking about the case when we need to evolve many tables on a regular basics. Some are very huge, 100 b + records.

bitsondatadev · 2022-01-21T17:33:19Z

@bitsondatadev Thank you for your reply!

We are looking for some logic like: ALTER TABLE table_name CHANGE [COLUMN] col_name column_new_type
...

Made a new issue for this. First step is to add the syntax. Then this should be easy to hook up to Iceberg.

RomantsovArtur · 2022-01-21T17:36:24Z

Thank you for the quick reply! Looks great 🚀

rimolive · 2022-02-23T20:28:00Z

Posting here as it seems the central location to enable full support for iceberg as a Trino connector: Is there already support for rewrite_data_files procedure?

nicor88 · 2022-03-22T20:07:35Z

Row Level Delete where added to Iceberg, this means we that DELETE/UPSERT/MERGO INTO are unlock.
I'm wondering when this feature will be included in trino connector (will it be cover by #10758)??

findepi · 2024-01-10T12:09:37Z

We don't use this issue for tracking Iceberg work anymore, so let me close it.
There will always be some work items within such a broad area as Iceberg.
Existing tickets can be found with

searching by label iceberg Iceberg connector
searching by text https://github.com/trinodb/trino/issues?q=is%3Aopen+iceberg

bitsondatadev · 2024-01-10T16:34:02Z

We don't use this issue for tracking Iceberg work anymore, so let me close it.

There will always be some work items within such a broad area as Iceberg.

Existing tickets can be found with

searching by label iceberg Iceberg connector

searching by text https://github.com/trinodb/trino/issues?q=is%3Aopen+iceberg

That being said, I really appreciate all the effort that was put into maintaining this initial roadmap.

That said, we should align in how we view larger efforts!

Thanks all!

findepi added the roadmap Top level issues for major efforts in the project label Aug 19, 2019

lxynov mentioned this issue Aug 21, 2019

Fix table listing in Iceberg to skip non-Iceberg tables #1354

Merged

This was referenced Sep 7, 2019

Fix fetching Hadoop configuration in Iceberg #1459

Merged

Replace HiveConfig with IcebergConfig #1469

Merged

This was referenced Sep 11, 2019

Fix Iceberg partition tables #1377

Merged

Simplify ParquetPageSource #1526

Merged

This was referenced Sep 18, 2019

Decouple ParquetPageSource from HiveColumnHandle #1552

Merged

Decouple OrcPageSource from HiveColumnHandle #1557

Merged

Praveen2112 mentioned this issue Sep 20, 2019

Iceberg predicate pushdown #1561

Merged

lxynov mentioned this issue Sep 23, 2019

Add support for Iceberg history tables #1575

Merged

lxynov mentioned this issue Oct 5, 2019

Introduce IcebergPageSource and IcebergColumnHandle #1675

Merged

lxynov mentioned this issue Oct 24, 2019

Add support for Iceberg snapshots tables #1850

Merged

lxynov mentioned this issue Nov 19, 2019

Iceberg Connector ORC support #2042

Merged

lxynov mentioned this issue Dec 6, 2019

Manifests and Files tables for Iceberg Connector #2223

Merged

jdintruff mentioned this issue Dec 18, 2019

Allow querying Iceberg table by its location, without registering it in metastore #2298

Open

findepi added the enhancement New feature or request label Mar 10, 2020

lxynov mentioned this issue Jun 17, 2020

Introduce IcebergFileWriter and collect file stats directly from ORC writers #4055

Merged

djsstarburst mentioned this issue Jun 18, 2020

Test iceberg create table like #4083

Merged

djsstarburst self-assigned this Jul 10, 2020

This was referenced Jul 11, 2020

Change Iceberg to use the new ParquetFileWriter #4424

Merged

Add support for Iceberg snapshot rollback #4283

Merged

This was referenced Jul 11, 2020

Bump Iceberg version to 0.8.0-incubating #4309

Merged

Support catalogs with Hive and Iceberg tables #4442

Closed

Fix HDFS impersonation in Iceberg Connector #4460

Merged

This was referenced Aug 5, 2020

Add support to redirect from a Hive catalog to an Iceberg catalog #4704

Closed

Add Avro support to Iceberg Connector #4776

Closed

electrum unassigned djsstarburst Oct 8, 2020

lxynov mentioned this issue Oct 29, 2020

Add table redirection support to SPI and Hive's redirection to Iceberg #5160

Closed

findepi mentioned this issue Nov 25, 2020

Update JDBI to 3.17.0 #6089

Merged

findepi mentioned this issue Oct 11, 2021

Data loss due to lack of commit orchestration in Iceberg #9583

Closed

findepi mentioned this issue Oct 22, 2021

Use dynamic filter to prune Iceberg splits based on partition values #9193

Merged

bitsondatadev mentioned this issue Jan 21, 2022

Add ALTER TABLE syntax to handle column type change #10736

Closed

alexjo2144 mentioned this issue Apr 28, 2022

Support updating an Iceberg table's partition scheme #12174

Closed

jebnix mentioned this issue Nov 19, 2022

Support JDBC catalog in Iceberg connector #11772

Merged

findepi closed this as completed Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg Connector #1324

Iceberg Connector #1324

lxynov commented Aug 19, 2019 •

edited by ebyhr

Loading

manishmalhotrawork commented Aug 19, 2019 •

edited

Loading

manishmalhotrawork commented Jan 28, 2020 •

edited

Loading

lxynov commented Jan 29, 2020

AbdullaevAPo commented Oct 13, 2020 •

edited

Loading

pPanda-beta commented Jan 13, 2021

pan3793 commented Mar 15, 2021

caneGuy commented Jun 25, 2021

KarlManong commented Jul 9, 2021

RomantsovArtur commented Dec 14, 2021

bitsondatadev commented Jan 21, 2022

RomantsovArtur commented Jan 21, 2022

bitsondatadev commented Jan 21, 2022

RomantsovArtur commented Jan 21, 2022

rimolive commented Feb 23, 2022

nicor88 commented Mar 22, 2022

findepi commented Jan 10, 2024 •

edited

Loading

bitsondatadev commented Jan 10, 2024

Iceberg Connector #1324

Iceberg Connector #1324

Comments

lxynov commented Aug 19, 2019 • edited by ebyhr Loading

TODOs for the Iceberg Connector

manishmalhotrawork commented Aug 19, 2019 • edited Loading

manishmalhotrawork commented Jan 28, 2020 • edited Loading

lxynov commented Jan 29, 2020

AbdullaevAPo commented Oct 13, 2020 • edited Loading

pPanda-beta commented Jan 13, 2021

pan3793 commented Mar 15, 2021

caneGuy commented Jun 25, 2021

KarlManong commented Jul 9, 2021

RomantsovArtur commented Dec 14, 2021

bitsondatadev commented Jan 21, 2022

RomantsovArtur commented Jan 21, 2022

bitsondatadev commented Jan 21, 2022

RomantsovArtur commented Jan 21, 2022

rimolive commented Feb 23, 2022

nicor88 commented Mar 22, 2022

findepi commented Jan 10, 2024 • edited Loading

bitsondatadev commented Jan 10, 2024

lxynov commented Aug 19, 2019 •

edited by ebyhr

Loading

manishmalhotrawork commented Aug 19, 2019 •

edited

Loading

manishmalhotrawork commented Jan 28, 2020 •

edited

Loading

AbdullaevAPo commented Oct 13, 2020 •

edited

Loading

findepi commented Jan 10, 2024 •

edited

Loading