You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If we query a Hive Partition table whose DDL is using Bigint, however if the underline parquet data is using int, it will fail due to column type mismatch.
GPU Spark errors:
ai.rapids.cudf.CudfException: cuDF failure at: ../src/join/hash_join.cu:391: Mismatch in joining column data types
CPU Spark errors:
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file file:///home/xxx/data/hive/testbigint/2022/part-00010-3d360f88-2fd7-4a22-b72e-ac055bb2c955-c000.snappy.parquet. Column: [b], Expected: bigint, Found: INT32
How to repro:
Hive CLI:
drop table testbigint;
create external table testbigint(a string, b bigint) PARTITIONED BY (k string)
STORED AS PARQUET LOCATION '/home/xxx/data/hive/testbigint';
create external table testbigint_dim(b string)
STORED AS PARQUET LOCATION '/home/xxx/data/hive/testbigint_dim';
Spark shell:
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val data = Seq(
Row("Adam",1000),
Row("Bob",2000),
Row("Cathy",9999)
)
val schema = StructType( Array(
StructField("a", StringType,true),
StructField("b", IntegerType,true)
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.write.format("parquet").mode("overwrite").save("/home/xxx/data/hive/testbigint/2022")
val data2 = Seq(
Row("Adam",1000L),
Row("Bob",2000L),
Row("Cathy",9999L)
)
val schema2 = StructType( Array(
StructField("a", StringType,true),
StructField("b", LongType,true)
))
val df2 = spark.createDataFrame(spark.sparkContext.parallelize(data2),schema2)
df2.write.format("parquet").mode("overwrite").save("/home/xxx/data/hive/testbigint/2021")
val data3 = Seq(Row("1000"),Row("2000"))
val schema3 = StructType( Array(StructField("b", StringType,true)))
val df3 = spark.createDataFrame(spark.sparkContext.parallelize(data3),schema3)
df3.write.format("parquet").mode("overwrite").save("/home/xxx/data/hive/testbigint_dim/")
Hive CLI
ALTER TABLE testbigint ADD PARTITION (k='2022') LOCATION '/home/xxx/data/hive/testbigint/2022';
ALTER TABLE testbigint ADD PARTITION (k='2021') LOCATION '/home/xxx/data/hive/testbigint/2021';
Spark shell
spark.conf.set("spark.rapids.sql.enabled",true)
spark.sql("""select count(*) from testbigint f, testbigint_dim d where f.b=d.b """).show
spark.conf.set("spark.rapids.sql.enabled",false)
spark.sql("""select count(*) from testbigint f, testbigint_dim d where f.b=d.b """).show
The text was updated successfully, but these errors were encountered:
Marking this as a P1 because the RAPIDS Accelerator should have generated an error when trying to convert the types loaded from the Parquet file into the types expected by the specified Spark read schema in the query plan. We need to figure out why the type checks that occur when converting a cudf Table into a Spark ColumnarBatch did not catch this.
I pick this issue since it appeals to me, while we are on holiday until May 5th. So, please take over it if it is urgent or someone is also interested in this issue.
sameerz
changed the title
[FEA] More detailed logs to show which parquet file and which data type has mismatch.
[BUG] More detailed logs to show which parquet file and which data type has mismatch.
May 10, 2022
Fixes#5200#5445
This PR is to add the schema check, referring the checking process of Spark. Converters downcasting INT32 to Byte/Short/Date are added in this PR as well. Note: This PR uses some deprecated API of parquet-mr, in order to accommodate Spark 3.1.
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
If we query a Hive Partition table whose DDL is using Bigint, however if the underline parquet data is using int, it will fail due to column type mismatch.
GPU Spark errors:
CPU Spark errors:
How to repro:
The text was updated successfully, but these errors were encountered: