Fix bug Delta column operations erases its properties on uppercase names #18123

ebyhr · 2023-07-04T10:31:48Z

Description

Fixes #18013
Fix the following issue

Show column comments, NOT NULL constraints correctly when column name contains uppercase characters. Previously, it returned incorrect information.
Fix issue when ADD COLUMN statement erases existing column properties when the column name contained uppercase characters.
Fix issue when COMMENT ON COLUMN statement erases existing column properties when the column name contained uppercase characters.
Store the exact column names in column statistics

The commit isn't separated because the code is tightly related to each other.

Release notes

# Delta Lake
* Show column comments, `NOT NULL` constraint correctly when the column name contains uppercase characters. ({issue}`issuenumber`)
* Fix `ADD COLUMN` statement not to delete column properties when the column name contains uppercase characters. ({issue}`issuenumber`)
* Fix `COMMENT ON COLUMN` statement not to delete column properties when the column name contains uppercase characters. ({issue}`issuenumber`)

findepi · 2023-07-31T09:55:18Z

Fix column case sensitivity issue in Delta Lake

Can you please amend the title so it hints at what is this fixing?

...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

ebyhr · 2023-08-01T07:33:52Z

@findepi Updated commit and PR title & description.

findepi

(not full review)

findepi · 2023-08-01T08:52:56Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeColumnHandle.java

@@ -159,7 +160,7 @@ public boolean equals(Object obj)
    public String getColumnName()
    {
        checkState(isBaseColumn(), "Unexpected dereference: %s", this);
-        return baseColumnName;
+        return baseColumnName.toLowerCase(ENGLISH);


why lowercase here?

I was trying to preserve the previous behavior for some usages, e.g. access control in TableChangesFunction.
Moved this toLowerCase to TableChangesFunction.

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeColumnMetadata.java

findepi · 2023-08-01T08:56:20Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

@@ -680,7 +681,7 @@ public Iterator<TableColumnsMetadata> streamTableColumns(ConnectorSession sessio
                        Map<String, Boolean> columnsNullability = getColumnsNullability(metadata);
                        Map<String, String> columnGenerations = getGeneratedColumnExpressions(metadata);
                        List<ColumnMetadata> columnMetadata = getColumns(metadata).stream()
-                                .map(column -> getColumnMetadata(column, columnComments.get(column.getColumnName()), columnsNullability.getOrDefault(column.getBaseColumnName(), true), columnGenerations.get(column.getBaseColumnName())))
+                                .map(column -> getColumnMetadata(column, columnComments.get(column.getBaseColumnName()), columnsNullability.getOrDefault(column.getBaseColumnName(), true), columnGenerations.get(column.getBaseColumnName())))


columnComments has non-lowercased (original) keys.
column.getBaseColumnName() is (now) lowercased, so a mismatch.
am i reading this wrong?

columnComments has non-lowercased (original) keys.

Right.

column.getBaseColumnName() is (now) lowercased, so a mismatch.

getBaseColumnName was lowercased previously. It returns the original column names after this change.

findepi · 2023-08-01T08:57:30Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

-            ImmutableList.Builder<String> columnNames = ImmutableList.builderWithExpectedSize(tableMetadata.getColumns().size());
-            ImmutableMap.Builder<String, Object> columnTypes = ImmutableMap.builderWithExpectedSize(tableMetadata.getColumns().size());
-            for (ColumnMetadata columnMetadata : tableMetadata.getColumns()) {
-                if (columnMetadata.isHidden()) {


how removal of this relates to case sensitivity?

The previous logic used getTableMetadata and getPartitionedBy that lowercased column names internally.

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

findepi · 2023-08-01T09:04:31Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

@@ -2994,7 +2993,7 @@ private void updateTableStatistics(
    private static String toPhysicalColumnName(String columnName, Optional<Map<String, String>> physicalColumnNameMapping)
    {
        if (physicalColumnNameMapping.isPresent()) {
-            String physicalColumnName = physicalColumnNameMapping.get().get(columnName);
+            String physicalColumnName = physicalColumnNameMapping.get().get(columnName.toLowerCase(ENGLISH));


why lower-case here?

The columnName variable is lowercase when it comes from ComputedStatistics. Added the code comment.

it's not obvious that keys in physicalColumnNameMapping are lowercase.

can we map columnName to original/exact column name and then map to physical name as separate step?

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeOutputTableHandle.java

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSource.java

findepi · 2023-08-01T09:27:11Z

Thanks for working on this. Even though the work is limited to Delta, it shows how lower-casing of String is a hard problem (#17).

I think it would help review & ensure we're doing the right thing if we can somehow express which String values or map keys are lower-cased and which are original. Perhaps practically all should be original except when interfacing with SPI, so we should eradicate most of toLowerCase calls (still many present in current state of this PR). If we cannot eradicate them, maybe we use some annotation or separate types classes as a demarkation between lower-cased and original, but at this point removing most of lower-casing looks more appealing to me.

ebyhr · 2023-08-02T03:13:30Z

Addressed comments. Removed toLowerCase calls from this PR as much as possible.

findepi · 2023-08-03T13:28:59Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

        for (DataFileInfo info : dataFileInfos) {
            // using Hashmap because partition values can be null
            Map<String, String> partitionValues = new HashMap<>();
            for (int i = 0; i < partitionColumnNames.size(); i++) {
                partitionValues.put(partitionColumnNames.get(i), info.getPartitionValues().get(i));
            }
+
+            Optional<Map<String, Object>> minStats = toOriginalColumnNames(info.getStatistics().getMinValues(), toOriginalColumnNames);


toOriginalColumnNames has this comment: "Lowercase column names because statistics generated by Trino has lowercase names"
yet, we're operating on DataFileInfo that Delta connector created (not engine)

are they lowercase because ColumnChunkMetaData.getPath ends up being lowercase in
https://github.com/trinodb/trino/blob/e26bf49d57cc596014b051485c7e0898d20798d5/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeWriter.java#L218?

please update the code comment in toOriginalColumnNames

Updated the code comment. Parquet file contains the original names, but it's lowercased when reading the metadata at

trino/lib/trino-parquet/src/main/java/io/trino/parquet/reader/MetadataReader.java

Lines 139 to 141 in 023f8b4

String[] path = metaData.path_in_schema.stream()

.map(value -> value.toLowerCase(Locale.ENGLISH))

.toArray(String[]::new);

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

findepi · 2023-08-03T13:40:41Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

@@ -2994,7 +2993,7 @@ private void updateTableStatistics(
    private static String toPhysicalColumnName(String columnName, Optional<Map<String, String>> physicalColumnNameMapping)
    {
        if (physicalColumnNameMapping.isPresent()) {
-            String physicalColumnName = physicalColumnNameMapping.get().get(columnName);
+            String physicalColumnName = physicalColumnNameMapping.get().get(columnName.toLowerCase(ENGLISH));


it's not obvious that keys in physicalColumnNameMapping are lowercase.

can we map columnName to original/exact column name and then map to physical name as separate step?

findepi · 2023-08-03T13:43:44Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeWriter.java

@@ -187,7 +188,8 @@ public DataFileInfo getDataFileInfo()
    {
        TrinoInputFile inputFile = fileSystem.newInputFile(rootTableLocation.appendPath(relativeFilePath));
        Map<String, Type> dataColumnTypes = columnHandles.stream()
-                .collect(toImmutableMap(DeltaLakeColumnHandle::getBasePhysicalColumnName, DeltaLakeColumnHandle::getBasePhysicalType));
+                // Lowercase because the subsequent logic expects lowercase


Map<String, Type> -> Map</* lowercase */ String, Type>

same in readStatistics & mergeStats params

.. or switch to CanonicalColumnName as map keys

~~Can we switch back to the original names inside of the DeltaLakeFileStatistics implementations instead of doing it remotely in io.trino.plugin.deltalake.DeltaLakeMetadata#appendAddFileEntries ?~~

findepi · 2023-08-03T13:55:02Z

i looked thru where we still lowercase and added some comments

general comments

we have CanonicalColumnName to convey case-insensitive matches. The class should probably be renamed because these days we wouldn't call "lowercase" and "canonical" the same thing.
DeltaLakeColumnMetadata.getName is confusing as it returns lowercase name, implicitly.
- the class should be refactored not to have ColumnMetadata as a field; instead it should have toColumnMetadata method.
- the getName should return actual name (jusjt like DeltaLakeColumnHandle.getColumnName), and then we don't need separate getOriginalName.

cc @alexjo2144 @findinpath

ebyhr · 2023-08-04T04:10:22Z

CI hit #17512

findinpath · 2023-08-04T16:10:18Z

...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/MetadataEntry.java

@@ -120,7 +120,7 @@ public List<String> getOriginalPartitionColumns()
     * For use in read-path. Returns lowercase partition column names.
     */
    @JsonIgnore
-    public List<String> getCanonicalPartitionColumns()
+    public List<String> getLowercasePartitionColumns()


~~IDE marks it as unused~~

IDE is wrong

However, the method is used only in tests.

This method is used by non-test methods as well.

findinpath · 2023-08-04T16:32:36Z

...no-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogUtil.java

@@ -42,8 +41,7 @@ public static Map<String, Optional<String>> canonicalizePartitionValues(Map<Stri
    {
        return partitionValues.entrySet().stream()
                .collect(toImmutableMap(
-                        // canonicalize partition keys to lowercase so they match column names used in DeltaLakeColumnHandle
-                        entry -> canonicalizeColumnName(entry.getKey()),
+                        Map.Entry::getKey,


Is this method still needed?
Same question for the field AddFileEntry#canonicalPartitionValues .

We don't have any canonicalization anymore here.

It's still needed because the value is still canonicalized in this method.

...src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeChangeDataFeedCompatibility.java

...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java

Also, show column properties correctly when column name contains uppercase characters and stores the exact column name in column statistics. Co-Authored-By: Slawomir Pajak <slawomir.pajak@starburstdata.com>

mosabua · 2023-08-08T15:57:02Z

Can you confirm that this does NOT need release notes entries and the suggested text in the description is redundant @ebyhr ?

ebyhr · 2023-08-08T21:35:46Z

@mosabua This PR fixes user-facing issues. I updated the PR description.

mosabua · 2023-08-08T21:39:56Z

ok .. thanks .. @colebow can you run with this now that you are back ;-)

cla-bot bot added the cla-signed label Jul 4, 2023

github-actions bot added tests:hive delta-lake Delta Lake connector labels Jul 4, 2023

ebyhr force-pushed the ebi/delta-column-case-sensitivity branch 2 times, most recently from 2330e07 to e75ba5e Compare July 14, 2023 05:29

ebyhr force-pushed the ebi/delta-column-case-sensitivity branch from e75ba5e to 190d982 Compare July 26, 2023 05:10

ebyhr mentioned this pull request Jul 26, 2023

Fix column case sensitivity issue in Delta Lake - CI regression fixes and statistics adaptation #18358

Merged

ebyhr force-pushed the ebi/delta-column-case-sensitivity branch 6 times, most recently from 14d8eb9 to 5905a22 Compare July 31, 2023 04:02

ebyhr marked this pull request as ready for review July 31, 2023 04:04

ebyhr requested review from findepi and findinpath July 31, 2023 04:05

findinpath requested a review from pajaks July 31, 2023 06:51

findinpath reviewed Jul 31, 2023

View reviewed changes

...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java Show resolved Hide resolved

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java Outdated Show resolved Hide resolved

ebyhr force-pushed the ebi/delta-column-case-sensitivity branch from 5905a22 to 0f52ff5 Compare August 1, 2023 04:03

ebyhr changed the title ~~Fix column case sensitivity issue in Delta Lake~~ Fix bug Delta column operations erases its properties on uppercase names Aug 1, 2023

ebyhr force-pushed the ebi/delta-column-case-sensitivity branch from 8772297 to affa1ad Compare August 1, 2023 07:01

findepi reviewed Aug 1, 2023

View reviewed changes

ebyhr force-pushed the ebi/delta-column-case-sensitivity branch 3 times, most recently from 36f1107 to cdf33b4 Compare August 2, 2023 03:12

ebyhr force-pushed the ebi/delta-column-case-sensitivity branch from cdf33b4 to dea8f93 Compare August 3, 2023 02:28

ebyhr force-pushed the ebi/delta-column-case-sensitivity branch from dea8f93 to e26bf49 Compare August 3, 2023 06:56

findepi mentioned this pull request Aug 3, 2023

Remove redundant lowercasing in Delta error messages #18523

Merged

findepi reviewed Aug 3, 2023

View reviewed changes

ebyhr force-pushed the ebi/delta-column-case-sensitivity branch from e26bf49 to 8a1c8a9 Compare August 4, 2023 01:50

findepi approved these changes Aug 4, 2023

View reviewed changes

findinpath reviewed Aug 4, 2023

View reviewed changes

...src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeChangeDataFeedCompatibility.java Outdated Show resolved Hide resolved

findinpath reviewed Aug 4, 2023

View reviewed changes

...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java Outdated Show resolved Hide resolved

findinpath approved these changes Aug 4, 2023

View reviewed changes

Fix bug Delta column operations erases its properties on uppercase names

6b39ecd

Also, show column properties correctly when column name contains uppercase characters and stores the exact column name in column statistics. Co-Authored-By: Slawomir Pajak <slawomir.pajak@starburstdata.com>

ebyhr force-pushed the ebi/delta-column-case-sensitivity branch from 8a1c8a9 to 6b39ecd Compare August 7, 2023 06:34

ebyhr merged commit df9140a into master Aug 7, 2023

ebyhr deleted the ebi/delta-column-case-sensitivity branch August 7, 2023 09:06

github-actions bot added this to the 423 milestone Aug 7, 2023

mosabua mentioned this pull request Aug 7, 2023

Add Trino 423 release notes #18496

Merged

pajaks mentioned this pull request Aug 9, 2023

Populate stats when missing in transaction log #16743

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug Delta column operations erases its properties on uppercase names #18123

Fix bug Delta column operations erases its properties on uppercase names #18123

ebyhr commented Jul 4, 2023 •

edited

Loading

findepi commented Jul 31, 2023

ebyhr commented Aug 1, 2023

findepi left a comment

findepi Aug 1, 2023

ebyhr Aug 1, 2023

findepi Aug 1, 2023

ebyhr Aug 1, 2023

findepi Aug 1, 2023

ebyhr Aug 1, 2023

findepi Aug 1, 2023

ebyhr Aug 1, 2023

findepi Aug 3, 2023

findepi commented Aug 1, 2023

ebyhr commented Aug 2, 2023

findepi Aug 3, 2023

ebyhr Aug 4, 2023

findepi Aug 3, 2023

findepi Aug 3, 2023

findinpath Aug 4, 2023 •

edited

Loading

findepi commented Aug 3, 2023

ebyhr commented Aug 4, 2023

findinpath Aug 4, 2023 •

edited

Loading

ebyhr Aug 7, 2023

findinpath Aug 4, 2023

ebyhr Aug 7, 2023

mosabua commented Aug 8, 2023

ebyhr commented Aug 8, 2023

mosabua commented Aug 8, 2023

	String[] path = metaData.path_in_schema.stream()
	.map(value -> value.toLowerCase(Locale.ENGLISH))
	.toArray(String[]::new);

Fix bug Delta column operations erases its properties on uppercase names #18123

Fix bug Delta column operations erases its properties on uppercase names #18123

Conversation

ebyhr commented Jul 4, 2023 • edited Loading

Description

Release notes

findepi commented Jul 31, 2023

ebyhr commented Aug 1, 2023

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Aug 1, 2023

ebyhr commented Aug 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findinpath Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

findepi commented Aug 3, 2023

ebyhr commented Aug 4, 2023

findinpath Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mosabua commented Aug 8, 2023

ebyhr commented Aug 8, 2023

mosabua commented Aug 8, 2023

ebyhr commented Jul 4, 2023 •

edited

Loading

findinpath Aug 4, 2023 •

edited

Loading

findinpath Aug 4, 2023 •

edited

Loading