xgboost4j-spark-gpu train failed on multiple gpu node with EXCLUSIVE_PROCESS mode #11119

yinqingh · 2024-12-19T06:50:48Z

xgboost4j-spark-gpu train failed on multiplue gpu node with EXCLUSIVE_PROCESS mode

Environment

OS: Ubuntu 22.04.2 LTS on OCI
Spark version: 3.5.0
XGBoost4j-spark: xgboost4j-spark-gpu_2.12-3.0.0-SNAPSHOT.jar
rapids-4-spark: rapids-4-spark_2.12-24.12.0-SNAPSHOT-cuda12.jar
GPU: 4* L40S

Failure logs

failed with EXCLUSIVE_PROCESS mode for all GPUs

cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable

24/12/19 03:23:24 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 2) (l40s.compute.sparkdev.oraclevcn.com executor 1): ml.dmlc.xgboost4j.java.XGBoostError: [03:23:24] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j-gpu.cu:331: [03:23:24] /workspace/src/common/device_vector.cu:23: Memory allocation error on worker 3: [03:23:24] /workspace/src/common/common.cu:16: /workspace/src/common/device_vector.cuh: 290: cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable
Stack trace:
  [bt] (0) /tmp/libxgboost4j8204609160539679832.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6c) [0x7fdbd6f35d2c]
  [bt] (1) /tmp/libxgboost4j8204609160539679832.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x4c6) [0x7fdbd7636fb6]
  [bt] (2) /tmp/libxgboost4j8204609160539679832.so(thrust::THRUST_200601_500_600_700_800_900_NS::detail::vector_base<float, dh::detail::XGBDefaultDeviceAllocatorImpl<float> >::append(unsigned long)+0x15e) [0x7fdbd766edbe]
  [bt] (3) /tmp/libxgboost4j8204609160539679832.so(void xgboost::jni::CopyMetaInfo<float>(xgboost::Json*, thrust::THRUST_200601_500_600_700_800_900_NS::device_vector<float, dh::detail::XGBDefaultDeviceAllocatorImpl<float> >*, CUstream_st*)+0x31b) [0x7fdbd7baef9b]
  [bt] (4) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::DataIteratorProxy::StageMetaInfo(xgboost::Json)+0x231) [0x7fdbd7bb0621]
  [bt] (5) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::DataIteratorProxy::StageData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0xb0) [0x7fdbd7bb0ee0]
  [bt] (6) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::DataIteratorProxy::PullIterFromJVM()+0x195) [0x7fdbd7bb15d5]
  [bt] (7) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::(anonymous namespace)::Next(void*)+0x67) [0x7fdbd7ba99f7]
  [bt] (8) /tmp/libxgboost4j8204609160539679832.so(xgboost::data::IterativeDMatrix::IterativeDMatrix(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int, long)+0x22f) [0x7fdbd7258c5f]


- Free memory: 26.6166GB
- Requested memory: 17.3242KB

Stack trace:
  [bt] (0) /tmp/libxgboost4j8204609160539679832.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6c) [0x7fdbd6f35d2c]
  [bt] (1) /tmp/libxgboost4j8204609160539679832.so(dh::detail::ThrowOOMError(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long)+0x493) [0x7fdbd7637cd3]
  [bt] (2) /tmp/libxgboost4j8204609160539679832.so(thrust::THRUST_200601_500_600_700_800_900_NS::detail::vector_base<float, dh::detail::XGBDefaultDeviceAllocatorImpl<float> >::append(unsigned long)+0x2a7) [0x7fdbd766ef07]
  [bt] (3) /tmp/libxgboost4j8204609160539679832.so(void xgboost::jni::CopyMetaInfo<float>(xgboost::Json*, thrust::THRUST_200601_500_600_700_800_900_NS::device_vector<float, dh::detail::XGBDefaultDeviceAllocatorImpl<float> >*, CUstream_st*)+0x31b) [0x7fdbd7baef9b]
  [bt] (4) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::DataIteratorProxy::StageMetaInfo(xgboost::Json)+0x231) [0x7fdbd7bb0621]
  [bt] (5) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::DataIteratorProxy::StageData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0xb0) [0x7fdbd7bb0ee0]
  [bt] (6) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::DataIteratorProxy::PullIterFromJVM()+0x195) [0x7fdbd7bb15d5]
  [bt] (7) /tmp/libxgboost4j8204609160539679832.so(xgboost::jni::(anonymous namespace)::Next(void*)+0x67) [0x7fdbd7ba99f7]
  [bt] (8) /tmp/libxgboost4j8204609160539679832.so(xgboost::data::IterativeDMatrix::IterativeDMatrix(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int, long)+0x22f) [0x7fdbd7258c5f]


Stack trace:
  [bt] (0) /tmp/libxgboost4j8204609160539679832.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6c) [0x7fdbd6f35d2c]
  [bt] (1) /tmp/libxgboost4j8204609160539679832.so(+0x3f5ea3) [0x7fdbd6df5ea3]
  [bt] (2) /tmp/libxgboost4j8204609160539679832.so(xgboost::data::IterativeDMatrix::IterativeDMatrix(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int, long)+0x22f) [0x7fdbd7258c5f]
  [bt] (3) /tmp/libxgboost4j8204609160539679832.so(xgboost::DMatrix* xgboost::DMatrix::Create<void*, void*, void (void*), int (void*)>(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int, long)+0x81) [0x7fdbd71e6871]
  [bt] (4) /tmp/libxgboost4j8204609160539679832.so(XGQuantileDMatrixCreateFromCallback+0x3af) [0x7fdbd6e3645f]
  [bt] (5) /tmp/libxgboost4j8204609160539679832.so(XGQuantileDMatrixCreateFromCallbackImpl+0x2bb) [0x7fdbd7ba977b]
  [bt] (6) /tmp/libxgboost4j8204609160539679832.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGQuantileDMatrixCreateFromCallback+0x93) [0x7fdbd7b9abe3]
  [bt] (7) [0x7ff0350183e7]


	at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
	at ml.dmlc.xgboost4j.java.QuantileDMatrix.<init>(QuantileDMatrix.java:69)
	at ml.dmlc.xgboost4j.java.QuantileDMatrix.<init>(QuantileDMatrix.java:38)
	at ml.dmlc.xgboost4j.scala.QuantileDMatrix.<init>(QuantileDMatrix.scala:36)
	at ml.dmlc.xgboost4j.scala.spark.GpuXGBoostPlugin.$anonfun$buildRddWatches$7(GpuXGBoostPlugin.scala:144)
	at scala.Option.getOrElse(Option.scala:189)
	at ml.dmlc.xgboost4j.scala.spark.GpuXGBoostPlugin.ml$dmlc$xgboost4j$scala$spark$GpuXGBoostPlugin$$buildQuantileDMatrix$1(GpuXGBoostPlugin.scala:144)
	at ml.dmlc.xgboost4j.scala.spark.GpuXGBoostPlugin$$anon$2.next(GpuXGBoostPlugin.scala:167)
	at ml.dmlc.xgboost4j.scala.spark.GpuXGBoostPlugin$$anon$2.next(GpuXGBoostPlugin.scala:164)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$train$2(XGBoost.scala:252)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

passed with following gpu settings.
- gpu0: DEFAULT mode
- gpu1: EXCLUSIVE_PROCESS
- gpu2: EXCLUSIVE_PROCESS
- gpu3: EXCLUSIVE_PROCESS

Observed processes on gpu 1,2,3 were also accessing gpu 0

The text was updated successfully, but these errors were encountered:

yinqingh · 2024-12-19T06:52:22Z

cc @wbo4958 @NvTimLiu

trivialfis · 2024-12-19T18:45:50Z

Hi, could you please share the use case for exclusive mode with spark cluster? Seems to be quite difficult to workaround if only a single process is allowed to access the GPU.

yinqingh · 2024-12-20T10:11:08Z

No real use case actually. I did not realize that it only works with default mode previously.

Another strange behavior is that the process 3981172 & 3981171 & 3981167 (should be spark executor processes) ran on GPU 1,2,3 firstly and then all these 3 processes were accessing GPU 0 instead of GPU 1,2,3. Not sure if this is expected behavior or not. You can see the processes section in the screenshot.

I tried to set GPU 1 to default mode and the process still tried to access different gpus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xgboost4j-spark-gpu train failed on multiple gpu node with EXCLUSIVE_PROCESS mode #11119

xgboost4j-spark-gpu train failed on multiple gpu node with EXCLUSIVE_PROCESS mode #11119

yinqingh commented Dec 19, 2024

yinqingh commented Dec 19, 2024

trivialfis commented Dec 19, 2024

yinqingh commented Dec 20, 2024 •

edited

Loading

xgboost4j-spark-gpu train failed on multiple gpu node with EXCLUSIVE_PROCESS mode #11119

xgboost4j-spark-gpu train failed on multiple gpu node with EXCLUSIVE_PROCESS mode #11119

Comments

yinqingh commented Dec 19, 2024

Environment

Failure logs

yinqingh commented Dec 19, 2024

trivialfis commented Dec 19, 2024

yinqingh commented Dec 20, 2024 • edited Loading

yinqingh commented Dec 20, 2024 •

edited

Loading