Make hive column matches not case-sensitive #11327

revans2 · 2024-08-14T16:07:42Z

This fixes #11318

I added in two tests. The partitioning test passes without these changes, but I wanted to be sure that we were doing the right thing.

I didn't add tests for Spark when it is made case sensitive because spark.sql.caseSensitive = true because it fails when spark goes to plan it both on the CPU and the GPU before the GPU code ever runs. But I can add tests for that if we really want to verify that.

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

revans2 · 2024-08-14T16:07:50Z

build

razajafri · 2024-08-14T21:56:46Z

sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/GpuHiveTableScanExec.scala

-    val distinctFields  = distinctColumns.map(a => tableSchema.apply(a.name))
+    // In hive column names are case-insensitive but the default tableSchema lookup is
+    // case-sensitive
+    val fieldMap = CaseInsensitiveMap(tableSchema.map(f => (f.name, f)).toMap)


What happens when spark.sql.caseSensitive is set to true?

Turns out that Hive is always case-insensitive So even if I try to create a table with two columns with different case I get an error from hive.

scala> spark.conf.set("spark.sql.caseSensitive", true) scala> spark.sql("""create table testcase_text(id int, nAme string, Name string)""").collect 24/08/15 15:38:30 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead. org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name name in the table definition. at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:244) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)

If I try to put in a name with the wrong case in the query when case sensitive is true, then spark outputs an error in the logical plan phase before the GPU code ever runs.

jlowe

lgtm. Might be good to add the caseSensitive test to verify failure, followup issue is fine, to catch if Spark changes the behavior and fixes that setup, since we'd need to also change at that point.

Make hive column matches not case-sensitive

846828e

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

sameerz added the bug Something isn't working label Aug 14, 2024

razajafri reviewed Aug 14, 2024

View reviewed changes

jlowe approved these changes Aug 15, 2024

View reviewed changes

revans2 merged commit 25be396 into NVIDIA:branch-24.10 Aug 15, 2024
44 checks passed

revans2 mentioned this pull request Aug 15, 2024

[FEA] Add tests for case-sensitive spark when reading hive tables #11332

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make hive column matches not case-sensitive #11327

Make hive column matches not case-sensitive #11327

revans2 commented Aug 14, 2024

revans2 commented Aug 14, 2024

razajafri Aug 14, 2024

revans2 Aug 15, 2024

jlowe left a comment

Make hive column matches not case-sensitive #11327

Make hive column matches not case-sensitive #11327

Conversation

revans2 commented Aug 14, 2024

revans2 commented Aug 14, 2024

razajafri Aug 14, 2024

Choose a reason for hiding this comment

revans2 Aug 15, 2024

Choose a reason for hiding this comment

jlowe left a comment

Choose a reason for hiding this comment