Support sorted writes in the Iceberg connector #14891

alexjo2144 · 2022-11-03T20:47:17Z

Description

Support sorting files during inserts to the Iceberg connector. This reuses the SortingFileWriter from the Hive connector.

Non-technical explanation

Sorting enables better performance during selective read queries, where a small range of values is needed from a high carnality column.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Support for Iceberg table sort orders. Tables can be created or altered to add a list of `sorted_by` columns which will be used to order files written to the table.

alexjo2144 · 2022-11-03T20:48:01Z

Note to self: rework the "Allow updating the sorted_by Iceberg table property" commit to give @osscm author credit.

findinpath · 2022-11-07T12:30:28Z

Add compatibility tests with spark.

findinpath · 2022-11-07T12:41:42Z

Sorting enables better performance during selective read queries, where a small range of values is needed from a high cardinality column.

I'd appreciate having a demo test that emphasises on the fact that the number of files being read is smaller when working with sorted_by. Source of inspiration: io.trino.plugin.iceberg.TestIcebergMetadataFileOperations

I don't see sortOrder in the IcebergTableHandle. Will eventually the sorted reads come as a follow-up PR?

findinpath · 2022-11-07T11:53:28Z

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/catalog/BaseTrinoCatalogTest.java

+            assertThat(catalog.listTables(SESSION, Optional.empty())).contains(schemaTableName);
+
+            Table icebergTable = catalog.loadTable(SESSION, schemaTableName);
+            assertEquals(icebergTable.name(), quotedTableName(schemaTableName));


The assertions related to the columns were already made in io.trino.plugin.iceberg.catalog.BaseTrinoCatalogTest#testCreateTable . Are they relevant here as well?

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/SortFields.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

findinpath · 2022-11-07T12:08:24Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/SortingFileWriterConfig.java

+        return writerSortBufferSize;
+    }
+
+    @Config("hive.writer-sort-buffer-size")


Pre-existing: It would be beneficial to have the purpose of this property documented (in the code).

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorSmokeTest.java

findinpath · 2022-11-07T12:20:40Z

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

-
-        dropTable("test_sorted_with_partition_table");
+                assertUpdate("INSERT INTO " + tableName + " VALUES (true, 1, 5), (false, 2, 4), (true, 3, 3), (false, 4, 2), (true, 5, 1)", 5);
+        assertQuery("SELECT * FROM " + tableName, "VALUES (true, 1, 5), (false, 2, 4), (true, 3, 3), (false, 4, 2), (true, 5, 1)");


What do we showcase in this test?
I think the test would pass also without sorted_by property specified on the table.

gentle reminder

new reminder :)

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV2.java

alexjo2144 · 2022-11-07T22:21:06Z

Still working on test cases and Marius' comments, but added support for sorting during updates and during optimize.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveConfig.java

findinpath · 2022-11-08T06:03:36Z

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergParquetConnectorTest.java

+    void verifyFileIsSorted(String path, String sortColumnName)
+    {
+        Comparable previousMax = null;
+        try (ParquetFileReader parquetReader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path), ConfigurationInstantiator.newEmptyConfiguration()))) {


Naive question: Why not read all the rows from the given parquet file and actually verify that any given row follows the sort order contract in respect to the previous row?

Hmmm, we could it just sounds more complex to set up the test.

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

alexjo2144 · 2022-11-15T21:24:29Z

@findinpath @findepi @ebyhr I think this is ready for review. Please take a look when you get a chance

findinpath

LGTM % comments

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHivePageSink.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java

findinpath · 2022-11-17T12:04:06Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/SortFields.java

+
+    private static void parseSortField(SortOrderBuilder<?> builder, String field)
+    {
+        boolean matched = tryMatch(field, IDENTITY_ASC_NULLS_FIRST_PATTERN, match -> builder.asc(fromIdentifierToColumn(match.group(1).trim()), NullOrder.NULLS_FIRST)) ||


Should we extract fromIdentifierToColumn to a shared utility class?

Probably, but I'd rather leave it until we add transform support and clean it up then. I think there are a few other things that should be central between both partition and sort transform parsing.

#15088

findinpath · 2022-11-17T12:06:49Z

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

-
-        dropTable("test_sorted_with_partition_table");
+                assertUpdate("INSERT INTO " + tableName + " VALUES (true, 1, 5), (false, 2, 4), (true, 3, 3), (false, 4, 2), (true, 5, 1)", 5);
+        assertQuery("SELECT * FROM " + tableName, "VALUES (true, 1, 5), (false, 2, 4), (true, 3, 3), (false, 4, 2), (true, 5, 1)");


gentle reminder

findinpath · 2022-11-17T12:56:26Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java

+        return sortedWritingEnabled;
+    }
+
+    @Config("iceberg.sorted-writing-enabled")


Is this a (temporary) fallback in case that the sorted writing does not work as expected?

It can also be useful if your writes are very small (streaming ingest, for example) such that sorting them would be a waste of time until they are compacted.

Writes being small or not sounds like a query-dependent, so it warrants session toggle more than a catalog config.

Also, can a writer detect that written data is small and not worth sorting?
OTOH, sorting small amount of data sounds like not a big deal (as long as it happens fully in memory and doesn't add latency), so why would we care?

The config should still remain as a kill switch.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSortingFileWriter.java

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

alexjo2144 · 2023-02-23T14:55:39Z

Addressed doc comments.

I do need to fix the tests though. I moved some to the smoke tests so that they run against all supported file systems but they're not working yet.

alexjo2144 · 2023-02-27T22:06:13Z

@ebyhr @findepi I think I got everything green with this one. Can you kick off a build with serets?

ebyhr · 2023-02-28T03:02:04Z

/test-with-secrets sha=0f09920b81b690612b026fcfd6e7c4cb252951ee

github-actions · 2023-02-28T07:16:52Z

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4289597275

findepi · 2023-02-28T08:46:01Z

Run with secrets failed: #13199 (reopened)

Co-authored-by: Alex Jo <jo.alex2144@gmail.com>

findepi · 2023-02-28T08:48:09Z

rebased to resolve a conflict

findepi · 2023-02-28T09:23:52Z

thanks!

Cherry-pick of trinodb/trino#14891 Co-authored-by: Alexander Jo <jo.alex2144@gmail.com>

cla-bot bot added the cla-signed label Nov 3, 2022

alexjo2144 marked this pull request as draft November 3, 2022 20:48

alexjo2144 requested review from homar, ebyhr and findepi November 3, 2022 20:48

github-actions bot added the tests:hive label Nov 3, 2022

alexjo2144 self-assigned this Nov 4, 2022

findinpath reviewed Nov 7, 2022

View reviewed changes

alexjo2144 force-pushed the iceberg/sorted-writes branch from ca6d0fb to 105b42b Compare November 7, 2022 22:15

findinpath reviewed Nov 8, 2022

View reviewed changes

alexjo2144 force-pushed the iceberg/sorted-writes branch 5 times, most recently from 37d76ab to e1191f4 Compare November 14, 2022 21:22

findepi mentioned this pull request Nov 14, 2022

Added support for sorted_by while creating iceberg table #12872

Closed

alexjo2144 force-pushed the iceberg/sorted-writes branch 2 times, most recently from 2362877 to b0d4cf0 Compare November 15, 2022 19:39

alexjo2144 marked this pull request as ready for review November 15, 2022 21:23

alexjo2144 requested review from osscm and findinpath November 15, 2022 21:23

findinpath reviewed Nov 17, 2022

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java Outdated Show resolved Hide resolved

alexjo2144 mentioned this pull request Nov 17, 2022

Support Iceberg sort transforms #15088

Open

alexjo2144 force-pushed the iceberg/sorted-writes branch from b0d4cf0 to 4cdd588 Compare November 17, 2022 18:44

alexjo2144 force-pushed the iceberg/sorted-writes branch 2 times, most recently from c39baaa to c3f4fc1 Compare February 24, 2023 18:17

alexjo2144 added the iceberg Iceberg connector label Feb 24, 2023

findepi mentioned this pull request Feb 27, 2023

Add support for CREATE OR REPLACE TABLE statement #13681

Merged

alexjo2144 force-pushed the iceberg/sorted-writes branch 2 times, most recently from 33590e7 to 0f09920 Compare February 27, 2023 17:19

findepi mentioned this pull request Feb 28, 2023

Table being modified concurrently happens in Glue tests #13199

Closed

alexjo2144 and others added 3 commits February 28, 2023 09:46

Migrate Iceberg GCS test to TrinoFileSystem

9dd5385

Extract writer sorting properties from HiveConfig

90b1fc6

Support the Iceberg sorted_by table property

43f58d8

Co-authored-by: Alex Jo <jo.alex2144@gmail.com>

findepi force-pushed the iceberg/sorted-writes branch from 0f09920 to f9d5336 Compare February 28, 2023 08:48

alexjo2144 and others added 2 commits February 28, 2023 09:51

Support sorted writes to Iceberg tables

a6cb722

Document Iceberg sorted_by table property

b7adc4c

findepi force-pushed the iceberg/sorted-writes branch from f9d5336 to b7adc4c Compare February 28, 2023 09:23

findepi merged commit da230aa into trinodb:master Feb 28, 2023

github-actions bot added this to the 409 milestone Feb 28, 2023

alexjo2144 deleted the iceberg/sorted-writes branch February 28, 2023 15:13

colebow mentioned this pull request Mar 1, 2023

Add Trino 409 release notes #16335

Merged

ebyhr mentioned this pull request May 26, 2023

Iceberg's create table to support sort_order property #12447

Closed

alexjo2144 mentioned this pull request Aug 16, 2023

Approximate written bytes in Hive and Iceberg sorted writers #18706

Merged

evanvdia mentioned this pull request Feb 21, 2024

Add Support for Iceberg table sort orders prestodb/presto#21977

Merged

7 tasks

evanvdia added a commit to evanvdia/presto that referenced this pull request Feb 7, 2025

Add Support for Iceberg table sort orders

d0a639b

Cherry-pick of trinodb/trino#14891 Co-authored-by: Alexander Jo <jo.alex2144@gmail.com>

ZacBlanco pushed a commit to prestodb/presto that referenced this pull request Feb 7, 2025

Add Support for Iceberg table sort orders

003d86a

Cherry-pick of trinodb/trino#14891 Co-authored-by: Alexander Jo <jo.alex2144@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sorted writes in the Iceberg connector #14891

Support sorted writes in the Iceberg connector #14891

alexjo2144 commented Nov 3, 2022 •

edited

Loading

alexjo2144 commented Nov 3, 2022 •

edited

Loading

findinpath commented Nov 7, 2022

findinpath commented Nov 7, 2022 •

edited

Loading

findinpath Nov 7, 2022

findinpath Nov 7, 2022

findinpath Nov 7, 2022

findinpath Nov 17, 2022

findinpath Dec 7, 2022

alexjo2144 commented Nov 7, 2022

findinpath Nov 8, 2022

alexjo2144 Nov 8, 2022

alexjo2144 commented Nov 15, 2022

findinpath left a comment

findinpath Nov 17, 2022

alexjo2144 Nov 17, 2022

findinpath Nov 17, 2022

findinpath Nov 17, 2022

alexjo2144 Nov 17, 2022

findepi Jan 16, 2023

alexjo2144 commented Feb 23, 2023

alexjo2144 commented Feb 27, 2023

ebyhr commented Feb 28, 2023

github-actions bot commented Feb 28, 2023

findepi commented Feb 28, 2023

findepi commented Feb 28, 2023

findepi commented Feb 28, 2023

Support sorted writes in the Iceberg connector #14891

Support sorted writes in the Iceberg connector #14891

Conversation

alexjo2144 commented Nov 3, 2022 • edited Loading

Description

Non-technical explanation

Release notes

alexjo2144 commented Nov 3, 2022 • edited Loading

findinpath commented Nov 7, 2022

findinpath commented Nov 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexjo2144 commented Nov 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexjo2144 commented Nov 15, 2022

findinpath left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexjo2144 commented Feb 23, 2023

alexjo2144 commented Feb 27, 2023

ebyhr commented Feb 28, 2023

github-actions bot commented Feb 28, 2023

findepi commented Feb 28, 2023

findepi commented Feb 28, 2023

findepi commented Feb 28, 2023

alexjo2144 commented Nov 3, 2022 •

edited

Loading

alexjo2144 commented Nov 3, 2022 •

edited

Loading

findinpath commented Nov 7, 2022 •

edited

Loading