Ensure that bucketing and sort column names correspond to table column names #16796

findinpath · 2023-03-29T21:28:51Z

Description

In the metastore, the bucketing and sorting column names can differ in case from its corresponding table column names. This change makes certain that, even though a table can be delivered by the metastore with such inconsistencies, Trino will make use of exactly the same bucketing and sort column names as their corresponding data column names.

Additional context and related issues

Reproduction scenario

testing/bin/ptl env up --environment 'singlenode-spark-hive'  --config 'config-default'

Connect to jdbc:hive2://localhost:10213/default
Spark

create table bucket_table(
                                    `row_id` int,
                                    `SEGMENT_ID` int
)
    partitioned by (`part` string)
    clustered by (`SEGMENT_ID`) into 10 buckets;

➜  ~ docker container exec -it ptl-hadoop-master /bin/bash
(reverse-i-search)`': ^C
[root@hadoop-master /]# mysql -D metaastore -u root -proot
MariaDB [metastore]> select * from COLUMNS_V2 where cd_id = 56;
+-------+---------+-------------+-----------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+-------+---------+-------------+-----------+-------------+
|    56 | NULL    | row_id      | int       |           0 |
|    56 | NULL    | segment_id  | int       |           1 |
+-------+---------+-------------+-----------+-------------+
2 rows in set (0.00 sec)


MariaDB [metastore]> select * from BUCKETING_COLS;
+-------+-----------------+-------------+
| SD_ID | BUCKET_COL_NAME | INTEGER_IDX |
+-------+-----------------+-------------+
|    56 | SEGMENT_ID      |           0 |
+-------+-----------------+-------------+

The inconsistency in case for the data column segment_id and the bucketing column SEGMENT_ID was causing in Trino the issue:

java.lang.IllegalArgumentException: Cannot find column 'SEGMENT_ID' in bucket_table at io.trino.plugin.hive.util.HiveBucketing.lambda$isSupportedBucketing$4(HiveBucketing.java:324)

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Ensure that bucketing and sort column names correspond to table column names. ({issue}`issuenumber`)

findepi · 2023-03-30T07:36:55Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveBucketProperty.java

+        List<String> bucketColumnNames = storageDescriptor.getBucketCols().stream()
+                // Ensure that the names used for the bucket columns are the same as the names used for the table columns
+                .map(bucketColumnName -> dataColumns.stream().filter(column -> column.getName().equalsIgnoreCase(bucketColumnName))
+                        .findFirst()


there should be exactly one such

also, linear search over column list is OK provided that there are only few bucketing columns, which is a reasonable assumption (otherwise we would build a set of data column names). add a code comment.

however, i wonder whether we have to do this validation here at all
i think it should be sufficient to lowercase the column names.

storageDescriptor.getBucketCols().stream() .map(name -> name.toLowerCase(ENGLISH)) .collect(toImmList())

I followed your suggestion and lowercased the bucketing and sorting columns to target them matching the data column names

findepi · 2023-03-30T07:37:19Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveBucketProperty.java

@@ -52,7 +53,7 @@ public HiveBucketProperty(
        this.sortedBy = ImmutableList.copyOf(requireNonNull(sortedBy, "sortedBy is null"));
    }

-    public static Optional<HiveBucketProperty> fromStorageDescriptor(Map<String, String> tableParameters, StorageDescriptor storageDescriptor, String tablePartitionName)
+    public static Optional<HiveBucketProperty> fromStorageDescriptor(Map<String, String> tableParameters, StorageDescriptor storageDescriptor, String tablePartitionName, List<Column> dataColumns)


is it legal to bucket on a partitioning column? (i know it doesn't make sense)

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveBucketProperty.java

findinpath · 2023-03-30T11:20:35Z

CI hit #14637

In the metastore, the bucketing and sorting column names can differ in case from its corresponding table column names. This change makes certain that, even though a table can be delivered by the metastore with such inconsistencies, Trino will lowercase the same bucketing and sort column names to ensure they correspond to the data column names.

findinpath · 2023-04-01T22:17:07Z

CI hit #12818

cla-bot bot added the cla-signed label Mar 29, 2023

findinpath mentioned this pull request Mar 29, 2023

findinpath Ensure that bucketing and sort column names correspond to table column names findinpath/trino#8

Closed

findinpath force-pushed the findinpath/hive-bucketing-case-insensitive branch from f2b874b to e4f3034 Compare March 29, 2023 21:34

github-actions bot added hive Hive connector tests:hive labels Mar 30, 2023

findinpath requested review from electrum, findepi and ebyhr March 30, 2023 06:09

findinpath self-assigned this Mar 30, 2023

findepi reviewed Mar 30, 2023

View reviewed changes

findinpath force-pushed the findinpath/hive-bucketing-case-insensitive branch from e4f3034 to 7e043a7 Compare March 31, 2023 16:03

findinpath requested a review from findepi March 31, 2023 16:05

findepi approved these changes Apr 3, 2023

View reviewed changes

findepi merged commit f61f5e5 into trinodb:master Apr 3, 2023

findepi mentioned this pull request Apr 3, 2023

Release notes for 412 #16798

Closed

findinpath added this to the 412 milestone Apr 3, 2023

colebow mentioned this pull request Apr 4, 2023

Add Trino 412 release notes #16881

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that bucketing and sort column names correspond to table column names #16796

Ensure that bucketing and sort column names correspond to table column names #16796

findinpath commented Mar 29, 2023 •

edited

Loading

findepi Mar 30, 2023

findinpath Mar 31, 2023

findepi Mar 30, 2023

findinpath commented Mar 30, 2023

findinpath commented Apr 1, 2023

Ensure that bucketing and sort column names correspond to table column names #16796

Ensure that bucketing and sort column names correspond to table column names #16796

Conversation

findinpath commented Mar 29, 2023 • edited Loading

Description

Additional context and related issues

Reproduction scenario

Release notes

findepi Mar 30, 2023

Choose a reason for hiding this comment

findinpath Mar 31, 2023

Choose a reason for hiding this comment

findepi Mar 30, 2023

Choose a reason for hiding this comment

findinpath commented Mar 30, 2023

findinpath commented Apr 1, 2023

findinpath commented Mar 29, 2023 •

edited

Loading