-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug Delta column operations erases its properties on uppercase names #18123
Conversation
2330e07
to
e75ba5e
Compare
e75ba5e
to
190d982
Compare
14d8eb9
to
5905a22
Compare
Can you please amend the title so it hints at what is this fixing? |
...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
5905a22
to
0f52ff5
Compare
8772297
to
affa1ad
Compare
@findepi Updated commit and PR title & description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not full review)
@@ -159,7 +160,7 @@ public boolean equals(Object obj) | |||
public String getColumnName() | |||
{ | |||
checkState(isBaseColumn(), "Unexpected dereference: %s", this); | |||
return baseColumnName; | |||
return baseColumnName.toLowerCase(ENGLISH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why lowercase here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to preserve the previous behavior for some usages, e.g. access control in TableChangesFunction.
Moved this toLowerCase
to TableChangesFunction.
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeColumnMetadata.java
Outdated
Show resolved
Hide resolved
@@ -680,7 +681,7 @@ public Iterator<TableColumnsMetadata> streamTableColumns(ConnectorSession sessio | |||
Map<String, Boolean> columnsNullability = getColumnsNullability(metadata); | |||
Map<String, String> columnGenerations = getGeneratedColumnExpressions(metadata); | |||
List<ColumnMetadata> columnMetadata = getColumns(metadata).stream() | |||
.map(column -> getColumnMetadata(column, columnComments.get(column.getColumnName()), columnsNullability.getOrDefault(column.getBaseColumnName(), true), columnGenerations.get(column.getBaseColumnName()))) | |||
.map(column -> getColumnMetadata(column, columnComments.get(column.getBaseColumnName()), columnsNullability.getOrDefault(column.getBaseColumnName(), true), columnGenerations.get(column.getBaseColumnName()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columnComments
has non-lowercased (original) keys.
column.getBaseColumnName()
is (now) lowercased, so a mismatch.
am i reading this wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columnComments has non-lowercased (original) keys.
Right.
column.getBaseColumnName() is (now) lowercased, so a mismatch.
getBaseColumnName was lowercased previously. It returns the original column names after this change.
ImmutableList.Builder<String> columnNames = ImmutableList.builderWithExpectedSize(tableMetadata.getColumns().size()); | ||
ImmutableMap.Builder<String, Object> columnTypes = ImmutableMap.builderWithExpectedSize(tableMetadata.getColumns().size()); | ||
for (ColumnMetadata columnMetadata : tableMetadata.getColumns()) { | ||
if (columnMetadata.isHidden()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how removal of this relates to case sensitivity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous logic used getTableMetadata
and getPartitionedBy
that lowercased column names internally.
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
@@ -2994,7 +2993,7 @@ private void updateTableStatistics( | |||
private static String toPhysicalColumnName(String columnName, Optional<Map<String, String>> physicalColumnNameMapping) | |||
{ | |||
if (physicalColumnNameMapping.isPresent()) { | |||
String physicalColumnName = physicalColumnNameMapping.get().get(columnName); | |||
String physicalColumnName = physicalColumnNameMapping.get().get(columnName.toLowerCase(ENGLISH)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why lower-case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The columnName
variable is lowercase when it comes from ComputedStatistics
. Added the code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not obvious that keys in physicalColumnNameMapping
are lowercase.
can we map columnName to original/exact column name and then map to physical name as separate step?
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeOutputTableHandle.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSource.java
Outdated
Show resolved
Hide resolved
Thanks for working on this. Even though the work is limited to Delta, it shows how lower-casing of String is a hard problem (#17). I think it would help review & ensure we're doing the right thing if we can somehow express which |
36f1107
to
cdf33b4
Compare
Addressed comments. Removed |
cdf33b4
to
dea8f93
Compare
dea8f93
to
e26bf49
Compare
for (DataFileInfo info : dataFileInfos) { | ||
// using Hashmap because partition values can be null | ||
Map<String, String> partitionValues = new HashMap<>(); | ||
for (int i = 0; i < partitionColumnNames.size(); i++) { | ||
partitionValues.put(partitionColumnNames.get(i), info.getPartitionValues().get(i)); | ||
} | ||
|
||
Optional<Map<String, Object>> minStats = toOriginalColumnNames(info.getStatistics().getMinValues(), toOriginalColumnNames); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toOriginalColumnNames
has this comment: "Lowercase column names because statistics generated by Trino has lowercase names"
yet, we're operating on DataFileInfo
that Delta connector created (not engine)
are they lowercase because ColumnChunkMetaData.getPath
ends up being lowercase in
https://github.com/trinodb/trino/blob/e26bf49d57cc596014b051485c7e0898d20798d5/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeWriter.java#L218?
please update the code comment in toOriginalColumnNames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code comment. Parquet file contains the original names, but it's lowercased when reading the metadata at
trino/lib/trino-parquet/src/main/java/io/trino/parquet/reader/MetadataReader.java
Lines 139 to 141 in 023f8b4
String[] path = metaData.path_in_schema.stream() | |
.map(value -> value.toLowerCase(Locale.ENGLISH)) | |
.toArray(String[]::new); |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
@@ -2994,7 +2993,7 @@ private void updateTableStatistics( | |||
private static String toPhysicalColumnName(String columnName, Optional<Map<String, String>> physicalColumnNameMapping) | |||
{ | |||
if (physicalColumnNameMapping.isPresent()) { | |||
String physicalColumnName = physicalColumnNameMapping.get().get(columnName); | |||
String physicalColumnName = physicalColumnNameMapping.get().get(columnName.toLowerCase(ENGLISH)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not obvious that keys in physicalColumnNameMapping
are lowercase.
can we map columnName to original/exact column name and then map to physical name as separate step?
@@ -187,7 +188,8 @@ public DataFileInfo getDataFileInfo() | |||
{ | |||
TrinoInputFile inputFile = fileSystem.newInputFile(rootTableLocation.appendPath(relativeFilePath)); | |||
Map<String, Type> dataColumnTypes = columnHandles.stream() | |||
.collect(toImmutableMap(DeltaLakeColumnHandle::getBasePhysicalColumnName, DeltaLakeColumnHandle::getBasePhysicalType)); | |||
// Lowercase because the subsequent logic expects lowercase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Map<String, Type>
-> Map</* lowercase */ String, Type>
same in readStatistics
& mergeStats
params
.. or switch to CanonicalColumnName
as map keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we switch back to the original names inside of the DeltaLakeFileStatistics
implementations instead of doing it remotely in io.trino.plugin.deltalake.DeltaLakeMetadata#appendAddFileEntries
?
i looked thru where we still lowercase and added some comments general comments
|
e26bf49
to
8a1c8a9
Compare
CI hit #17512 |
@@ -120,7 +120,7 @@ public List<String> getOriginalPartitionColumns() | |||
* For use in read-path. Returns lowercase partition column names. | |||
*/ | |||
@JsonIgnore | |||
public List<String> getCanonicalPartitionColumns() | |||
public List<String> getLowercasePartitionColumns() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDE marks it as unused
IDE is wrong
However, the method is used only in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -42,8 +41,7 @@ public static Map<String, Optional<String>> canonicalizePartitionValues(Map<Stri | |||
{ | |||
return partitionValues.entrySet().stream() | |||
.collect(toImmutableMap( | |||
// canonicalize partition keys to lowercase so they match column names used in DeltaLakeColumnHandle | |||
entry -> canonicalizeColumnName(entry.getKey()), | |||
Map.Entry::getKey, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method still needed?
Same question for the field AddFileEntry#canonicalPartitionValues
.
We don't have any canonicalization anymore here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still needed because the value is still canonicalized in this method.
...src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeChangeDataFeedCompatibility.java
Outdated
Show resolved
Hide resolved
...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java
Outdated
Show resolved
Hide resolved
Also, show column properties correctly when column name contains uppercase characters and stores the exact column name in column statistics. Co-Authored-By: Slawomir Pajak <slawomir.pajak@starburstdata.com>
8a1c8a9
to
6b39ecd
Compare
Can you confirm that this does NOT need release notes entries and the suggested text in the description is redundant @ebyhr ? |
@mosabua This PR fixes user-facing issues. I updated the PR description. |
ok .. thanks .. @colebow can you run with this now that you are back ;-) |
Description
Fixes #18013
Fix the following issue
The commit isn't separated because the code is tightly related to each other.
Release notes