[BUG] When running xgboost training, if PCBS is enabled, it fails with java.lang.AssertionError #4806

viadea · 2022-02-16T20:46:13Z

When running xgboost training step in the example notebook, if PCBS is enabled, it fails with java.lang.AssertionError:

WARN TaskSetManager: Lost task 0.0 in stage 13.0 (TID 27) (192.192.192.2 executor 0): java.lang.AssertionError: assertion failed: User-defined types in Catalyst schema should have already been expanded:
{
  "type" : "struct",
...
...

	at scala.Predef$.assert(Predef.scala:223)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.<init>(ParquetRowConverter.scala:158)
	at org.apache.spark.sql.execution.datasources.parquet.rapids.shims.v2.ShimParquetRowConverter.<init>(ShimVectorizedColumnReader.scala:45)
	at org.apache.spark.sql.execution.datasources.parquet.rapids.shims.v2.ParquetRecordMaterializer.<init>(ParquetMaterializer.scala:47)
	at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.$anonfun$convertCachedBatchToInternalRowIter$1(ParquetCachedBatchSerializer.scala:759)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer.withResource(ParquetCachedBatchSerializer.scala:262)
	at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.convertCachedBatchToInternalRowIter(ParquetCachedBatchSerializer.scala:743)
	at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.hasNext(ParquetCachedBatchSerializer.scala:723)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
	at com.nvidia.spark.rapids.RowToColumnarIterator.hasNext(GpuRowToColumnarExec.scala:602)
	at com.nvidia.spark.rapids.GpuBaseLimitExec$$anon$1.hasNext(limit.scala:68)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:287)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:304)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Test env:
Standalone Spark cluster
Spark 3.1.1
22.02 snapshot rapids-spark and cudf jars
xgboost4j_3.0-1.4.2-0.2.0.jar
xgboost4j-spark_3.0-1.4.2-0.2.0.jar

Make sure to enable PCBS:

spark.sql.cache.serializer com.nvidia.spark.ParquetCachedBatchSerializer

Workaround
If we comment out PCBS setting, the it works fine.

How to reproduce
Run this notebook:
https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.02/examples/Spark-ETL%2BXGBoost/mortgage/notebooks/scala/mortgage-gpu.ipynb

It will fail when training:

// Start training
println("\n------ Training ------")
val (xgbClassificationModel, _) = Benchmark.time("train") {
  xgbClassifier.fit(trainSet)
}

Note: this is not customer related. Just my own finding.

The text was updated successfully, but these errors were encountered:

razajafri · 2022-03-08T23:25:59Z

It fails on the next line where you are actually calling cache

println("\n------ Transforming ------")
val (results, _) = Benchmark.time("transform") {
  val ret = xgbClassificationModel.transform(transSet).cache()
  ret.foreachPartition((_: Iterator[_]) => ())
  ret
}
results.select("orig_channel", labelColName,"rawPrediction","probability","prediction").show(10)

viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 16, 2022

razajafri self-assigned this Feb 16, 2022

jlowe added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Feb 22, 2022

sameerz added this to the Feb 28 - Mar 18 milestone Mar 9, 2022

razajafri mentioned this issue Mar 15, 2022

Add UDT support to ParquetCachedBatchSerializer (CPU) #4955

Merged

sameerz modified the milestones: Feb 28 - Mar 18, Mar 21 - Apr 1 Mar 21, 2022

razajafri closed this as completed in #4955 Mar 21, 2022

viadea mentioned this issue Apr 6, 2022

[BUG] xgboost job failed if we enable PCBS #5138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] When running xgboost training, if PCBS is enabled, it fails with java.lang.AssertionError #4806

[BUG] When running xgboost training, if PCBS is enabled, it fails with java.lang.AssertionError #4806

viadea commented Feb 16, 2022

razajafri commented Mar 8, 2022

[BUG] When running xgboost training, if PCBS is enabled, it fails with java.lang.AssertionError #4806

[BUG] When running xgboost training, if PCBS is enabled, it fails with java.lang.AssertionError #4806

Comments

viadea commented Feb 16, 2022

razajafri commented Mar 8, 2022