[BUG] GPU query is case sensitive on Hive text table's column name #11318

viadea · 2024-08-12T18:04:14Z

Describe the bug
GPU query is case sensitive on Hive text table.

Steps/Code to reproduce bug
Spark-SQL:

create table testcase_text(id int, nAme string) ;
insert into testcase_text values(1,'Tom');
select name from testcase_text;

GPU run will fail with :

java.lang.IllegalArgumentException: name does not exist. Available: id, nAme
	at org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:282)
	at scala.collection.immutable.Map$Map2.getOrElse(Map.scala:236)
	at org.apache.spark.sql.types.StructType.apply(StructType.scala:281)
	at org.apache.spark.sql.hive.rapids.GpuHiveTableScanExec.$anonfun$getRequestedOutputDataSchema$3(GpuHiveTableScanExec.scala:236)
	at scala.collection.immutable.List.map(List.scala:293)
	at org.apache.spark.sql.hive.rapids.GpuHiveTableScanExec.getRequestedOutputDataSchema(GpuHiveTableScanExec.scala:236)
	at org.apache.spark.sql.hive.rapids.GpuHiveTableScanExec.inputRDD$lzycompute(GpuHiveTableScanExec.scala:337)
	at org.apache.spark.sql.hive.rapids.GpuHiveTableScanExec.inputRDD(GpuHiveTableScanExec.scala:323)
	at org.apache.spark.sql.hive.rapids.GpuHiveTableScanExec.internalDoExecuteColumnar(GpuHiveTableScanExec.scala:359)
	at com.nvidia.spark.rapids.GpuExec.doExecuteColumnar(GpuExec.scala:396)
	at com.nvidia.spark.rapids.GpuExec.doExecuteColumnar$(GpuExec.scala:394)
	at org.apache.spark.sql.hive.rapids.GpuHiveTableScanExec.doExecuteColumnar(GpuHiveTableScanExec.scala:76)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:221)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229)
	at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:217)
	at com.nvidia.spark.rapids.GpuColumnarToRowExec.doExecute(GpuColumnarToRowExec.scala:365)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:194)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:190)
	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:340)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:421)
	at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:451)
	at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$2(SparkSQLDriver.scala:76)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:76)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:396)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:516)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:510)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:510)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:298)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:973)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1061)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1070)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)```

Note 1: Hive parquet table works fine on GPU.
Note 2: Does not matter what value is set for spark.sql.caseSensitive=true or false, this issue always shows up on GPU.

Expected behavior
CPU run works fine by default and honors spark.sql.caseSensitive:

spark-sql> set spark.rapids.sql.enabled=false;
spark.rapids.sql.enabled	false
Time taken: 0.016 seconds, Fetched 1 row(s)
spark-sql> select name from testcase_text;
Tom
Time taken: 2.382 seconds, Fetched 1 row(s)

spark-sql> set spark.sql.caseSensitive=true;
spark.sql.caseSensitive	true
Time taken: 0.014 seconds, Fetched 1 row(s)
spark-sql> select name from testcase_text;
Error in query: Column 'name' does not exist. Did you mean one of the following? [spark_catalog.default.testcase_text.id, spark_catalog.default.testcase_text.nAme]; line 1 pos 7;
'Project ['name]
+- SubqueryAlias spark_catalog.default.testcase_text
   +- HiveTableRelation [`default`.`testcase_text`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#23, nAme#24], Partition Cols: []]

Environment details (please complete the following information)
Dataproc 2.1

The text was updated successfully, but these errors were encountered:

revans2 · 2024-08-13T21:10:48Z

So this is not the most straight forward thing. It looks like hive is always case insensitive for column name matches, except I have found a few places where people are complaining that partition by columns are case sensitive if the file system that they are running on is case sensitive (everything except some mac file systems).

So for the case of simplicity I am just going to try and make all of the column name matches case insensitive.

viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 12, 2024

mattahrens assigned revans2 Aug 13, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 13, 2024

revans2 mentioned this issue Aug 14, 2024

Make hive column matches not case-sensitive #11327

Merged

revans2 closed this as completed in #11327 Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] GPU query is case sensitive on Hive text table's column name #11318

[BUG] GPU query is case sensitive on Hive text table's column name #11318

viadea commented Aug 12, 2024 •

edited

Loading

revans2 commented Aug 13, 2024

[BUG] GPU query is case sensitive on Hive text table's column name #11318

[BUG] GPU query is case sensitive on Hive text table's column name #11318

Comments

viadea commented Aug 12, 2024 • edited Loading

revans2 commented Aug 13, 2024

viadea commented Aug 12, 2024 •

edited

Loading