Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make hive column matches not case-sensitive #11327

Merged
merged 1 commit into from
Aug 15, 2024

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Aug 14, 2024

This fixes #11318

I added in two tests. The partitioning test passes without these changes, but I wanted to be sure that we were doing the right thing.

I didn't add tests for Spark when it is made case sensitive because spark.sql.caseSensitive = true because it fails when spark goes to plan it both on the CPU and the GPU before the GPU code ever runs. But I can add tests for that if we really want to verify that.

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
@revans2
Copy link
Collaborator Author

revans2 commented Aug 14, 2024

build

@sameerz sameerz added the bug Something isn't working label Aug 14, 2024
val distinctFields = distinctColumns.map(a => tableSchema.apply(a.name))
// In hive column names are case-insensitive but the default tableSchema lookup is
// case-sensitive
val fieldMap = CaseInsensitiveMap(tableSchema.map(f => (f.name, f)).toMap)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when spark.sql.caseSensitive is set to true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that Hive is always case-insensitive So even if I try to create a table with two columns with different case I get an error from hive.

scala> spark.conf.set("spark.sql.caseSensitive", true)

scala> spark.sql("""create table testcase_text(id int, nAme string, Name string)""").collect
24/08/15 15:38:30 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name name in the table definition.
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110)
  at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:244)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)

If I try to put in a name with the wrong case in the query when case sensitive is true, then spark outputs an error in the logical plan phase before the GPU code ever runs.

Copy link
Contributor

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Might be good to add the caseSensitive test to verify failure, followup issue is fine, to catch if Spark changes the behavior and fixes that setup, since we'd need to also change at that point.

@revans2 revans2 merged commit 25be396 into NVIDIA:branch-24.10 Aug 15, 2024
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GPU query is case sensitive on Hive text table's column name
4 participants