Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GPU query is case sensitive on Hive text table's column name #11318

Closed
viadea opened this issue Aug 12, 2024 · 1 comment · Fixed by #11327
Closed

[BUG] GPU query is case sensitive on Hive text table's column name #11318

viadea opened this issue Aug 12, 2024 · 1 comment · Fixed by #11327
Assignees
Labels
bug Something isn't working

Comments

@viadea
Copy link
Collaborator

viadea commented Aug 12, 2024

Describe the bug
GPU query is case sensitive on Hive text table.

Steps/Code to reproduce bug
Spark-SQL:

create table testcase_text(id int, nAme string) ;
insert into testcase_text values(1,'Tom');
select name from testcase_text;

GPU run will fail with :

java.lang.IllegalArgumentException: name does not exist. Available: id, nAme
	at org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:282)
	at scala.collection.immutable.Map$Map2.getOrElse(Map.scala:236)
	at org.apache.spark.sql.types.StructType.apply(StructType.scala:281)
	at org.apache.spark.sql.hive.rapids.GpuHiveTableScanExec.$anonfun$getRequestedOutputDataSchema$3(GpuHiveTableScanExec.scala:236)
	at scala.collection.immutable.List.map(List.scala:293)
	at org.apache.spark.sql.hive.rapids.GpuHiveTableScanExec.getRequestedOutputDataSchema(GpuHiveTableScanExec.scala:236)
	at org.apache.spark.sql.hive.rapids.GpuHiveTableScanExec.inputRDD$lzycompute(GpuHiveTableScanExec.scala:337)
	at org.apache.spark.sql.hive.rapids.GpuHiveTableScanExec.inputRDD(GpuHiveTableScanExec.scala:323)
	at org.apache.spark.sql.hive.rapids.GpuHiveTableScanExec.internalDoExecuteColumnar(GpuHiveTableScanExec.scala:359)
	at com.nvidia.spark.rapids.GpuExec.doExecuteColumnar(GpuExec.scala:396)
	at com.nvidia.spark.rapids.GpuExec.doExecuteColumnar$(GpuExec.scala:394)
	at org.apache.spark.sql.hive.rapids.GpuHiveTableScanExec.doExecuteColumnar(GpuHiveTableScanExec.scala:76)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:221)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229)
	at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:217)
	at com.nvidia.spark.rapids.GpuColumnarToRowExec.doExecute(GpuColumnarToRowExec.scala:365)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:194)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:190)
	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:340)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:421)
	at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:451)
	at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$2(SparkSQLDriver.scala:76)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:76)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:396)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:516)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:510)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:510)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:298)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:973)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1061)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1070)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)```

Note 1: Hive parquet table works fine on GPU.
Note 2: Does not matter what value is set for spark.sql.caseSensitive=true or false, this issue always shows up on GPU.

Expected behavior
CPU run works fine by default and honors spark.sql.caseSensitive:

spark-sql> set spark.rapids.sql.enabled=false;
spark.rapids.sql.enabled	false
Time taken: 0.016 seconds, Fetched 1 row(s)
spark-sql> select name from testcase_text;
Tom
Time taken: 2.382 seconds, Fetched 1 row(s)

spark-sql> set spark.sql.caseSensitive=true;
spark.sql.caseSensitive	true
Time taken: 0.014 seconds, Fetched 1 row(s)
spark-sql> select name from testcase_text;
Error in query: Column 'name' does not exist. Did you mean one of the following? [spark_catalog.default.testcase_text.id, spark_catalog.default.testcase_text.nAme]; line 1 pos 7;
'Project ['name]
+- SubqueryAlias spark_catalog.default.testcase_text
   +- HiveTableRelation [`default`.`testcase_text`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#23, nAme#24], Partition Cols: []]

Environment details (please complete the following information)
Dataproc 2.1

@viadea viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 12, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 13, 2024
@revans2
Copy link
Collaborator

revans2 commented Aug 13, 2024

So this is not the most straight forward thing. It looks like hive is always case insensitive for column name matches, except I have found a few places where people are complaining that partition by columns are case sensitive if the file system that they are running on is case sensitive (everything except some mac file systems).

So for the case of simplicity I am just going to try and make all of the column name matches case insensitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants