Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that bucketing and sort column names correspond to table column names #16796

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented Mar 29, 2023

Description

In the metastore, the bucketing and sorting column names can differ in case from its corresponding table column names. This change makes certain that, even though a table can be delivered by the metastore with such inconsistencies, Trino will make use of exactly the same bucketing and sort column names as their corresponding data column names.

Additional context and related issues

Reproduction scenario

testing/bin/ptl env up --environment 'singlenode-spark-hive'  --config 'config-default'

Connect to jdbc:hive2://localhost:10213/default
Spark

create table bucket_table(
                                    `row_id` int,
                                    `SEGMENT_ID` int
)
    partitioned by (`part` string)
    clustered by (`SEGMENT_ID`) into 10 buckets;

➜  ~ docker container exec -it ptl-hadoop-master /bin/bash
(reverse-i-search)`': ^C
[root@hadoop-master /]# mysql -D metaastore -u root -proot
MariaDB [metastore]> select * from COLUMNS_V2 where cd_id = 56;
+-------+---------+-------------+-----------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+-------+---------+-------------+-----------+-------------+
|    56 | NULL    | row_id      | int       |           0 |
|    56 | NULL    | segment_id  | int       |           1 |
+-------+---------+-------------+-----------+-------------+
2 rows in set (0.00 sec)


MariaDB [metastore]> select * from BUCKETING_COLS;
+-------+-----------------+-------------+
| SD_ID | BUCKET_COL_NAME | INTEGER_IDX |
+-------+-----------------+-------------+
|    56 | SEGMENT_ID      |           0 |
+-------+-----------------+-------------+

The inconsistency in case for the data column segment_id and the bucketing column SEGMENT_ID was causing in Trino the issue:

java.lang.IllegalArgumentException: Cannot find column 'SEGMENT_ID' in bucket_table at io.trino.plugin.hive.util.HiveBucketing.lambda$isSupportedBucketing$4(HiveBucketing.java:324)

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Ensure that bucketing and sort column names correspond to table column names. ({issue}`issuenumber`)

List<String> bucketColumnNames = storageDescriptor.getBucketCols().stream()
// Ensure that the names used for the bucket columns are the same as the names used for the table columns
.map(bucketColumnName -> dataColumns.stream().filter(column -> column.getName().equalsIgnoreCase(bucketColumnName))
.findFirst()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be exactly one such

also, linear search over column list is OK provided that there are only few bucketing columns, which is a reasonable assumption (otherwise we would build a set of data column names). add a code comment.

however, i wonder whether we have to do this validation here at all
i think it should be sufficient to lowercase the column names.

storageDescriptor.getBucketCols().stream()
 .map(name -> name.toLowerCase(ENGLISH))
 .collect(toImmList())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed your suggestion and lowercased the bucketing and sorting columns to target them matching the data column names

@@ -52,7 +53,7 @@ public HiveBucketProperty(
this.sortedBy = ImmutableList.copyOf(requireNonNull(sortedBy, "sortedBy is null"));
}

public static Optional<HiveBucketProperty> fromStorageDescriptor(Map<String, String> tableParameters, StorageDescriptor storageDescriptor, String tablePartitionName)
public static Optional<HiveBucketProperty> fromStorageDescriptor(Map<String, String> tableParameters, StorageDescriptor storageDescriptor, String tablePartitionName, List<Column> dataColumns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it legal to bucket on a partitioning column? (i know it doesn't make sense)

@findinpath
Copy link
Contributor Author

CI hit #14637

In the metastore, the bucketing and sorting column names can differ
in case from its corresponding table column names.
This change makes certain that, even though a table can be
delivered by the metastore with such inconsistencies, Trino will lowercase
the same bucketing and sort column names to ensure they correspond to the
data column names.
@findinpath findinpath force-pushed the findinpath/hive-bucketing-case-insensitive branch from e4f3034 to 7e043a7 Compare March 31, 2023 16:03
@findinpath findinpath requested a review from findepi March 31, 2023 16:05
@findinpath
Copy link
Contributor Author

CI hit #12818

@findepi findepi merged commit f61f5e5 into trinodb:master Apr 3, 2023
@findepi findepi mentioned this pull request Apr 3, 2023
@findinpath findinpath added this to the 412 milestone Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

2 participants