You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
18/06/28 15:09:51|INFO|XGBoostSpark:Rabit returns with exit code 3
Exception in thread "main" ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:408)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:358)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:339)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:338)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:139)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:36)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame(XGBoost.scala:195)
at ml.dmlc.xgboost4j.scala.spark.XGBoostUtils.testBinaryClass(XGBoostUtils.scala:115)
at ml.dmlc.spark.SparkMain.main(SparkMain.java:14)
What have you tried?
I tried print real exception,so i add printStackTrace in function waitFor of RabitTracker.scala
privatedefwaitFor(atMost: Duration):Int= {
// request the completion Future from the tracker actorTry(Await.result(handler ?RabitTrackerHandler.RequestCompletionFuture, askTimeout.duration)
.asInstanceOf[Future[Int]]) match {
caseSuccess(futureCompleted) =>// wait for all workers to complete synchronously.valstatusCode=Try(Await.result(futureCompleted, atMost)) match {
caseSuccess(n) if n == numWorkers =>IRabitTracker.TrackerStatus.SUCCESS.getStatusCode
caseSuccess(n) if n < numWorkers =>IRabitTracker.TrackerStatus.TIMEOUT.getStatusCode
caseFailure(e) =>IRabitTracker.TrackerStatus.FAILURE.getStatusCode
}
system.shutdown()
statusCode
caseFailure(ex: Throwable) =>
ex.printStackTrace()
if (!system.isTerminated) {
system.shutdown()
}
IRabitTracker.TrackerStatus.FAILURE.getStatusCode
}
}
See this stackTrace
akka.pattern.AskTimeoutException:Recipient[Actor[akka://RabitTracker/user/Handler#-1269920980]] had already been terminated.
at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:132)
at akka.pattern.AskableActorRef$.$qmark$extension(AskSupport.scala:144)
at ml.dmlc.xgboost4j.scala.rabit.RabitTracker$$anonfun$5.apply(RabitTracker.scala:158)
at ml.dmlc.xgboost4j.scala.rabit.RabitTracker$$anonfun$5.apply(RabitTracker.scala:159)
at scala.util.Try$.apply(Try.scala:192)
at ml.dmlc.xgboost4j.scala.rabit.RabitTracker.waitFor(RabitTracker.scala:158)
at ml.dmlc.xgboost4j.scala.rabit.RabitTracker.waitFor(RabitTracker.scala:192)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4$$anonfun$2.apply$mcI$sp(XGBoost.scala:356)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4$$anonfun$2.apply(XGBoost.scala:356)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4$$anonfun$2.apply(XGBoost.scala:356)
at org.apache.spark.SparkParallelismTracker.safeExecute(SparkParallelismTracker.scala:82)
at org.apache.spark.SparkParallelismTracker.execute(SparkParallelismTracker.scala:108)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:356)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:339)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:338)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:139)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:36)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame(XGBoost.scala:195)
at ml.dmlc.xgboost4j.scala.spark.XGBoostUtils.testBinaryClass(XGBoostUtils.scala:115)
at ml.dmlc.spark.SparkMain.main(SparkMain.java:14)
So , look at this exception, i tried to avoid this happen, i delete one line in function handleRabitWorkerMessage of RabitTrackerHandler.scala, that is do not stop self actor ref when tracker handler receive worker shud down, and it works.
And if my tracker conf set to "python", it works too.
My doubt is why this problem happens and why it seems ok in linux env?
caseWorkerShutdown(rank, _, _) =>
assert(rank >=0, "Invalid rank.")
assert(!shutdownWorkers.contains(rank))
shutdownWorkers.add(rank)
log.info(s"Received shutdown signal from $rank")
if (shutdownWorkers.size == numWorkers) {
promisedShutdownWorkers.success(shutdownWorkers.size)
println(s"Do not stop self handler ${self}")
// context.stop(self)
}
The text was updated successfully, but these errors were encountered:
Environment info
Operating System:
Windows 7
Compiler:
scala 2.11.8, Spark 2.11
Package used (python/R/jvm/C++):
jvm packages
xgboost
version used:release-0.72
Steps to reproduce
1.import jvm_packages to idea project
2.add testing code for running on spark local
3.running code,
log like this
What have you tried?
And if my tracker conf set to "python", it works too.
My doubt is why this problem happens and why it seems ok in linux env?
The text was updated successfully, but these errors were encountered: