You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running xgboost training step in the example notebook, if PCBS is enabled, it fails with java.lang.AssertionError:
WARN TaskSetManager: Lost task 0.0 in stage 13.0 (TID 27) (192.192.192.2 executor 0): java.lang.AssertionError: assertion failed: User-defined types in Catalyst schema should have already been expanded:
{
"type" : "struct",
...
...
at scala.Predef$.assert(Predef.scala:223)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.<init>(ParquetRowConverter.scala:158)
at org.apache.spark.sql.execution.datasources.parquet.rapids.shims.v2.ShimParquetRowConverter.<init>(ShimVectorizedColumnReader.scala:45)
at org.apache.spark.sql.execution.datasources.parquet.rapids.shims.v2.ParquetRecordMaterializer.<init>(ParquetMaterializer.scala:47)
at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.$anonfun$convertCachedBatchToInternalRowIter$1(ParquetCachedBatchSerializer.scala:759)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer.withResource(ParquetCachedBatchSerializer.scala:262)
at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.convertCachedBatchToInternalRowIter(ParquetCachedBatchSerializer.scala:743)
at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.hasNext(ParquetCachedBatchSerializer.scala:723)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at com.nvidia.spark.rapids.RowToColumnarIterator.hasNext(GpuRowToColumnarExec.scala:602)
at com.nvidia.spark.rapids.GpuBaseLimitExec$$anon$1.hasNext(limit.scala:68)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:287)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:304)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Test env:
Standalone Spark cluster
Spark 3.1.1
22.02 snapshot rapids-spark and cudf jars
xgboost4j_3.0-1.4.2-0.2.0.jar
xgboost4j-spark_3.0-1.4.2-0.2.0.jar
When running xgboost training step in the example notebook, if PCBS is enabled, it fails with java.lang.AssertionError:
Test env:
Standalone Spark cluster
Spark 3.1.1
22.02 snapshot rapids-spark and cudf jars
xgboost4j_3.0-1.4.2-0.2.0.jar
xgboost4j-spark_3.0-1.4.2-0.2.0.jar
Make sure to enable PCBS:
Workaround
If we comment out PCBS setting, the it works fine.
How to reproduce
Run this notebook:
https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.02/examples/Spark-ETL%2BXGBoost/mortgage/notebooks/scala/mortgage-gpu.ipynb
It will fail when training:
Note: this is not customer related. Just my own finding.
The text was updated successfully, but these errors were encountered: