Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug Delta column operations erases its properties on uppercase names #18123

Merged
merged 1 commit into from
Aug 7, 2023

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Jul 4, 2023

Description

Fixes #18013
Fix the following issue

  • Show column comments, NOT NULL constraints correctly when column name contains uppercase characters. Previously, it returned incorrect information.
  • Fix issue when ADD COLUMN statement erases existing column properties when the column name contained uppercase characters.
  • Fix issue when COMMENT ON COLUMN statement erases existing column properties when the column name contained uppercase characters.
  • Store the exact column names in column statistics

The commit isn't separated because the code is tightly related to each other.

Release notes

# Delta Lake
* Show column comments, `NOT NULL` constraint correctly when the column name contains uppercase characters. ({issue}`issuenumber`)
* Fix `ADD COLUMN` statement not to delete column properties when the column name contains uppercase characters. ({issue}`issuenumber`)
* Fix `COMMENT ON COLUMN` statement not to delete column properties when the column name contains uppercase characters. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jul 4, 2023
@github-actions github-actions bot added tests:hive delta-lake Delta Lake connector labels Jul 4, 2023
@ebyhr ebyhr force-pushed the ebi/delta-column-case-sensitivity branch 2 times, most recently from 2330e07 to e75ba5e Compare July 14, 2023 05:29
@ebyhr ebyhr force-pushed the ebi/delta-column-case-sensitivity branch from e75ba5e to 190d982 Compare July 26, 2023 05:10
@ebyhr ebyhr force-pushed the ebi/delta-column-case-sensitivity branch 6 times, most recently from 14d8eb9 to 5905a22 Compare July 31, 2023 04:02
@ebyhr ebyhr marked this pull request as ready for review July 31, 2023 04:04
@ebyhr ebyhr requested review from findepi and findinpath July 31, 2023 04:05
@findinpath findinpath requested a review from pajaks July 31, 2023 06:51
@findepi
Copy link
Member

findepi commented Jul 31, 2023

Fix column case sensitivity issue in Delta Lake

Can you please amend the title so it hints at what is this fixing?

@ebyhr ebyhr force-pushed the ebi/delta-column-case-sensitivity branch from 5905a22 to 0f52ff5 Compare August 1, 2023 04:03
@ebyhr ebyhr changed the title Fix column case sensitivity issue in Delta Lake Fix bug Delta column operations erases its properties on uppercase names Aug 1, 2023
@ebyhr ebyhr force-pushed the ebi/delta-column-case-sensitivity branch from 8772297 to affa1ad Compare August 1, 2023 07:01
@ebyhr
Copy link
Member Author

ebyhr commented Aug 1, 2023

@findepi Updated commit and PR title & description.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not full review)

@@ -159,7 +160,7 @@ public boolean equals(Object obj)
public String getColumnName()
{
checkState(isBaseColumn(), "Unexpected dereference: %s", this);
return baseColumnName;
return baseColumnName.toLowerCase(ENGLISH);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why lowercase here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to preserve the previous behavior for some usages, e.g. access control in TableChangesFunction.
Moved this toLowerCase to TableChangesFunction.

@@ -680,7 +681,7 @@ public Iterator<TableColumnsMetadata> streamTableColumns(ConnectorSession sessio
Map<String, Boolean> columnsNullability = getColumnsNullability(metadata);
Map<String, String> columnGenerations = getGeneratedColumnExpressions(metadata);
List<ColumnMetadata> columnMetadata = getColumns(metadata).stream()
.map(column -> getColumnMetadata(column, columnComments.get(column.getColumnName()), columnsNullability.getOrDefault(column.getBaseColumnName(), true), columnGenerations.get(column.getBaseColumnName())))
.map(column -> getColumnMetadata(column, columnComments.get(column.getBaseColumnName()), columnsNullability.getOrDefault(column.getBaseColumnName(), true), columnGenerations.get(column.getBaseColumnName())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columnComments has non-lowercased (original) keys.
column.getBaseColumnName() is (now) lowercased, so a mismatch.
am i reading this wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columnComments has non-lowercased (original) keys.

Right.

column.getBaseColumnName() is (now) lowercased, so a mismatch.

getBaseColumnName was lowercased previously. It returns the original column names after this change.

ImmutableList.Builder<String> columnNames = ImmutableList.builderWithExpectedSize(tableMetadata.getColumns().size());
ImmutableMap.Builder<String, Object> columnTypes = ImmutableMap.builderWithExpectedSize(tableMetadata.getColumns().size());
for (ColumnMetadata columnMetadata : tableMetadata.getColumns()) {
if (columnMetadata.isHidden()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how removal of this relates to case sensitivity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous logic used getTableMetadata and getPartitionedBy that lowercased column names internally.

@@ -2994,7 +2993,7 @@ private void updateTableStatistics(
private static String toPhysicalColumnName(String columnName, Optional<Map<String, String>> physicalColumnNameMapping)
{
if (physicalColumnNameMapping.isPresent()) {
String physicalColumnName = physicalColumnNameMapping.get().get(columnName);
String physicalColumnName = physicalColumnNameMapping.get().get(columnName.toLowerCase(ENGLISH));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why lower-case here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The columnName variable is lowercase when it comes from ComputedStatistics. Added the code comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not obvious that keys in physicalColumnNameMapping are lowercase.

can we map columnName to original/exact column name and then map to physical name as separate step?

@findepi
Copy link
Member

findepi commented Aug 1, 2023

Thanks for working on this. Even though the work is limited to Delta, it shows how lower-casing of String is a hard problem (#17).

I think it would help review & ensure we're doing the right thing if we can somehow express which String values or map keys are lower-cased and which are original. Perhaps practically all should be original except when interfacing with SPI, so we should eradicate most of toLowerCase calls (still many present in current state of this PR). If we cannot eradicate them, maybe we use some annotation or separate types classes as a demarkation between lower-cased and original, but at this point removing most of lower-casing looks more appealing to me.

@ebyhr ebyhr force-pushed the ebi/delta-column-case-sensitivity branch 3 times, most recently from 36f1107 to cdf33b4 Compare August 2, 2023 03:12
@ebyhr
Copy link
Member Author

ebyhr commented Aug 2, 2023

Addressed comments. Removed toLowerCase calls from this PR as much as possible.

@ebyhr ebyhr force-pushed the ebi/delta-column-case-sensitivity branch from cdf33b4 to dea8f93 Compare August 3, 2023 02:28
for (DataFileInfo info : dataFileInfos) {
// using Hashmap because partition values can be null
Map<String, String> partitionValues = new HashMap<>();
for (int i = 0; i < partitionColumnNames.size(); i++) {
partitionValues.put(partitionColumnNames.get(i), info.getPartitionValues().get(i));
}

Optional<Map<String, Object>> minStats = toOriginalColumnNames(info.getStatistics().getMinValues(), toOriginalColumnNames);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toOriginalColumnNames has this comment: "Lowercase column names because statistics generated by Trino has lowercase names"
yet, we're operating on DataFileInfo that Delta connector created (not engine)

are they lowercase because ColumnChunkMetaData.getPath ends up being lowercase in
https://github.com/trinodb/trino/blob/e26bf49d57cc596014b051485c7e0898d20798d5/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeWriter.java#L218?

please update the code comment in toOriginalColumnNames

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code comment. Parquet file contains the original names, but it's lowercased when reading the metadata at

String[] path = metaData.path_in_schema.stream()
.map(value -> value.toLowerCase(Locale.ENGLISH))
.toArray(String[]::new);

@@ -2994,7 +2993,7 @@ private void updateTableStatistics(
private static String toPhysicalColumnName(String columnName, Optional<Map<String, String>> physicalColumnNameMapping)
{
if (physicalColumnNameMapping.isPresent()) {
String physicalColumnName = physicalColumnNameMapping.get().get(columnName);
String physicalColumnName = physicalColumnNameMapping.get().get(columnName.toLowerCase(ENGLISH));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not obvious that keys in physicalColumnNameMapping are lowercase.

can we map columnName to original/exact column name and then map to physical name as separate step?

@@ -187,7 +188,8 @@ public DataFileInfo getDataFileInfo()
{
TrinoInputFile inputFile = fileSystem.newInputFile(rootTableLocation.appendPath(relativeFilePath));
Map<String, Type> dataColumnTypes = columnHandles.stream()
.collect(toImmutableMap(DeltaLakeColumnHandle::getBasePhysicalColumnName, DeltaLakeColumnHandle::getBasePhysicalType));
// Lowercase because the subsequent logic expects lowercase
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map<String, Type> -> Map</* lowercase */ String, Type>

same in readStatistics & mergeStats params

.. or switch to CanonicalColumnName as map keys

Copy link
Contributor

@findinpath findinpath Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we switch back to the original names inside of the DeltaLakeFileStatistics implementations instead of doing it remotely in io.trino.plugin.deltalake.DeltaLakeMetadata#appendAddFileEntries ?

@findepi
Copy link
Member

findepi commented Aug 3, 2023

i looked thru where we still lowercase and added some comments

general comments

  • we have CanonicalColumnName to convey case-insensitive matches. The class should probably be renamed because these days we wouldn't call "lowercase" and "canonical" the same thing.
  • DeltaLakeColumnMetadata.getName is confusing as it returns lowercase name, implicitly.
    • the class should be refactored not to have ColumnMetadata as a field; instead it should have toColumnMetadata method.
    • the getName should return actual name (jusjt like DeltaLakeColumnHandle.getColumnName), and then we don't need separate getOriginalName.

cc @alexjo2144 @findinpath

@ebyhr ebyhr force-pushed the ebi/delta-column-case-sensitivity branch from e26bf49 to 8a1c8a9 Compare August 4, 2023 01:50
@ebyhr
Copy link
Member Author

ebyhr commented Aug 4, 2023

CI hit #17512

@@ -120,7 +120,7 @@ public List<String> getOriginalPartitionColumns()
* For use in read-path. Returns lowercase partition column names.
*/
@JsonIgnore
public List<String> getCanonicalPartitionColumns()
public List<String> getLowercasePartitionColumns()
Copy link
Contributor

@findinpath findinpath Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDE marks it as unused

IDE is wrong

However, the method is used only in tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used by non-test methods as well.

Screenshot 2023-08-07 at 15 27 01

@@ -42,8 +41,7 @@ public static Map<String, Optional<String>> canonicalizePartitionValues(Map<Stri
{
return partitionValues.entrySet().stream()
.collect(toImmutableMap(
// canonicalize partition keys to lowercase so they match column names used in DeltaLakeColumnHandle
entry -> canonicalizeColumnName(entry.getKey()),
Map.Entry::getKey,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method still needed?
Same question for the field AddFileEntry#canonicalPartitionValues .

We don't have any canonicalization anymore here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still needed because the value is still canonicalized in this method.

Also, show column properties correctly when column name contains
uppercase characters and stores the exact column name
in column statistics.

Co-Authored-By: Slawomir Pajak <slawomir.pajak@starburstdata.com>
@ebyhr ebyhr force-pushed the ebi/delta-column-case-sensitivity branch from 8a1c8a9 to 6b39ecd Compare August 7, 2023 06:34
@ebyhr ebyhr merged commit df9140a into master Aug 7, 2023
@ebyhr ebyhr deleted the ebi/delta-column-case-sensitivity branch August 7, 2023 09:06
@github-actions github-actions bot added this to the 423 milestone Aug 7, 2023
@mosabua
Copy link
Member

mosabua commented Aug 8, 2023

Can you confirm that this does NOT need release notes entries and the suggested text in the description is redundant @ebyhr ?

@ebyhr
Copy link
Member Author

ebyhr commented Aug 8, 2023

@mosabua This PR fixes user-facing issues. I updated the PR description.

@mosabua
Copy link
Member

mosabua commented Aug 8, 2023

ok .. thanks .. @colebow can you run with this now that you are back ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector
Development

Successfully merging this pull request may close these issues.

Column properties are ignored when the column name contains uppercase characters
4 participants