Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] When running xgboost training, if PCBS is enabled, it fails with java.lang.AssertionError #4806

Closed
viadea opened this issue Feb 16, 2022 · 1 comment · Fixed by #4955
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@viadea
Copy link
Collaborator

viadea commented Feb 16, 2022

When running xgboost training step in the example notebook, if PCBS is enabled, it fails with java.lang.AssertionError:

WARN TaskSetManager: Lost task 0.0 in stage 13.0 (TID 27) (192.192.192.2 executor 0): java.lang.AssertionError: assertion failed: User-defined types in Catalyst schema should have already been expanded:
{
  "type" : "struct",
...
...

	at scala.Predef$.assert(Predef.scala:223)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.<init>(ParquetRowConverter.scala:158)
	at org.apache.spark.sql.execution.datasources.parquet.rapids.shims.v2.ShimParquetRowConverter.<init>(ShimVectorizedColumnReader.scala:45)
	at org.apache.spark.sql.execution.datasources.parquet.rapids.shims.v2.ParquetRecordMaterializer.<init>(ParquetMaterializer.scala:47)
	at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.$anonfun$convertCachedBatchToInternalRowIter$1(ParquetCachedBatchSerializer.scala:759)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer.withResource(ParquetCachedBatchSerializer.scala:262)
	at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.convertCachedBatchToInternalRowIter(ParquetCachedBatchSerializer.scala:743)
	at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.hasNext(ParquetCachedBatchSerializer.scala:723)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
	at com.nvidia.spark.rapids.RowToColumnarIterator.hasNext(GpuRowToColumnarExec.scala:602)
	at com.nvidia.spark.rapids.GpuBaseLimitExec$$anon$1.hasNext(limit.scala:68)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:287)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:304)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Test env:
Standalone Spark cluster
Spark 3.1.1
22.02 snapshot rapids-spark and cudf jars
xgboost4j_3.0-1.4.2-0.2.0.jar
xgboost4j-spark_3.0-1.4.2-0.2.0.jar

Make sure to enable PCBS:

spark.sql.cache.serializer com.nvidia.spark.ParquetCachedBatchSerializer

Workaround
If we comment out PCBS setting, the it works fine.

How to reproduce
Run this notebook:
https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.02/examples/Spark-ETL%2BXGBoost/mortgage/notebooks/scala/mortgage-gpu.ipynb

It will fail when training:

// Start training
println("\n------ Training ------")
val (xgbClassificationModel, _) = Benchmark.time("train") {
  xgbClassifier.fit(trainSet)
}

Note: this is not customer related. Just my own finding.

@viadea viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 16, 2022
@razajafri razajafri self-assigned this Feb 16, 2022
@jlowe jlowe added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Feb 22, 2022
@razajafri
Copy link
Collaborator

It fails on the next line where you are actually calling cache

println("\n------ Transforming ------")
val (results, _) = Benchmark.time("transform") {
  val ret = xgbClassificationModel.transform(transSet).cache()
  ret.foreachPartition((_: Iterator[_]) => ())
  ret
}
results.select("orig_channel", labelColName,"rawPrediction","probability","prediction").show(10)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants