-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XGBoost error code 255 #181
Comments
@albertodema I don't see any explicit errors emitted except the error status. I would recommend asking on https://github.com/dmlc/xgboost/issues |
@tovbinm thanks for your input but they will likely to ask me how transmogrifai is calling their module , with which parameters , etc.. |
@albertodema I think it might be related this issue dmlc/xgboost#2449, since when we do cross validation we train multiple models in parallel. So I tried setting the parallelism to 1 - and the error still happens sometimes. So my bet that there is some race condition that happens which I am not sure how to track yet. @CodingCat might have some ideas? |
which version of xgb are you using? |
The latest - |
Ok..we are supposed to have fixed this issue in 0.81...and I can actually run cross validation without any issue...can you provide a way to reproduce consistenly? |
Here the code (use the following commit d0785f0 , the input file is here (the arg(0) parameter): import com.salesforce.op._
import com.salesforce.op.features.FeatureBuilder
import com.salesforce.op.features.types._
import com.salesforce.op.readers.DataReaders
import com.salesforce.op.stages.impl.classification.BinaryClassificationModelsToTry.{ OpXGBoostClassifier}
import com.salesforce.op.stages.impl.classification._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, LogManager}
/**
* A minimal Titanic Survival example with TransmogrifAI
*/
object OpTitanicMini {
case class Passenger
(
id: Long,
survived: Double,
pClass: Option[Long],
name: Option[String],
sex: Option[String],
age: Option[Double],
sibSp: Option[Long],
parCh: Option[Long],
ticket: Option[String],
fare: Option[Double],
cabin: Option[String],
embarked: Option[String]
)
def main(args: Array[String]): Unit = {
LogManager.getLogger("com.salesforce.op").setLevel(Level.ERROR)
implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
import spark.implicits._
// Read Titanic data as a DataFrame
val pathToData = Option(args(0))
val passengersData = DataReaders.Simple.csvCase[Passenger](pathToData, key = _.id.toString).readDataset().toDF()
// Automated feature engineering
val (survived, features) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")
val passengerId = features.find(_.name == "id").map(_.asInstanceOf[FeatureLike[Integral]]).get
val featureVector = features.transmogrify()
// Automated feature selection
val checkedFeatures = survived.sanityCheck(featureVector, checkSample = 1.0, removeBadFeatures = true)
// Automated model selection
val prediction = BinaryClassificationModelSelector
.withCrossValidation(modelTypesToUse = Seq(OpXGBoostClassifier))
.setInput(survived, checkedFeatures).getOutput()
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(passengerId, checkedFeatures,prediction).train()
println("Model summary:\n" + model.summaryPretty())
}
} |
@albertodema I will start looking into this...where did you run this, a laptop or a cluster? |
@CodingCat on a laptop with IntelliJ first than inside a docker container, I tried to launch spark also in single core mode but the error happens the same. |
Here is how to reproduce. I train 10 xgboost models in parallel and it fails: val sparse = RandomVector.sparse(RandomReal.uniform[Real](), 1000).take(10000)
val labels = RandomBinary(0.5).withProbabilityOfEmpty(0.0).take(10000).map(b => b.toDouble.toRealNN(0.0))
val sample = sparse.zip(labels).toSeq
val (data, features, label) = TestFeatureBuilder(sample)
(1 to 10).par.map { _ =>
val x = new XGBoostClassifier().setLabelCol(label.name).setFeaturesCol(features.name)
x.set(x.trackerConf, TrackerConf(0L, "scala"))
val xm = x.fit(data)
val xtransformed = xm.transform(data)
xtransformed.show()
} Error:
|
so it only happens with parallel model training? |
With parallel execution it is constantly reproducible. Sometimes it also comes up when training multiple models sequentially, but it's rather rare. |
Is this question still being followed up? I also encountered the same problem. |
Yes, we are aware of the problem, but we were unable to track down the reason for it yet. Perhaps you want to look into it? @zhenchuan this would be a very valuable contribution :) |
Is it possible related to this dmlc/xgboost#4054 I had instances where it would sometimes work and sometimes wouldn't (within transmogrifai). So i went to just a vanilla xgboost-spark and found the same thing (in both staigth model training and crossvalidation). Training would fail, and then there would be an issue with dead letters. |
@timsetsfire thanks. I will give it a try. Also xgboost |
Same error persists also with xgboost 0.82 Here is another error of the same type dmlc/xgboost#3418 @CodingCat any suggestions on how to overcome it? |
are you actually using scala-version of rabit tracker? |
Yes, it fails with Scala tracker (Python implementation of rabbit tracker on Databricks works great). |
ah......scala tracker.....out of maintenance for a while...... |
Thanks @wsuchy and @CodingCat |
We will update project with the upcoming 0.83 (once available). |
Hello, I also encountered the same problem, I use a spark - 2.3.2, xgboost4j - spark used is 0.90, and then throw model training failure (ml. DMLC. Xgboost4j. Java. XGBoostError: XGBoostModel training failed).
My code is as follows: val (response,feature) = FeatureBuilder.fromDataFrame[RealNN](frame,label)
println(s"response = ${response}")
val features = feature.dropWhile{case x=>x.name==id}
println("============== opFeatures ==============")
features.foreach(println(_))
val transmogrifyFeature = features.transmogrify()
val checkedFeature = response.sanityCheck(transmogrifyFeature,removeBadFeatures = true)
val prediction = BinaryClassificationModelSelector.withTrainValidationSplit(
modelTypesToUse = Seq(OpXGBoostClassifier)
).setInput(response, checkedFeature).getOutput()
val evaluator = Evaluators.BinaryClassification().setLabelCol(label).setPredictionCol(prediction)
val workflow = new OpWorkflow().setInputDataset(frame,(row: Row)=>row.get(0).toString).setResultFeatures(prediction)
println("============training===========")
val model = workflow.train()
println(s"Model Summary:\n ${model.summaryPretty()}") Thank you for reading and look forward to your reply,thanks! |
@zhenchuan which TransmogrifAI version are you using? |
@tovbinm Hello, I am using version 0.60 |
@shenzgang XGBoost fix to this issue comes with this PR - #402. So you can either try compiling your local version of TransmogrifAI by pulling the repo, checkout the branch Or you can wait until we released the next version of TransmogrifAI. Perhaps @gerashegalov @leahmcguire can comment out when to be precise. |
@tovbinm and when I use OpNaiveBayes for dichotomous cross-training, I throw the following exceptions: |
I think this might have been fixed in this PR - #404 Try using TransmogrifAI 0.6.1 release |
Ok, thanks! I'll keep following transmogrifai! |
TransmogrifAI 0.6.1 was released 2 weeks ago. Are you asking when we will release with the updated spark version? |
Hi I am trying to use the new XGBoost support in master (latest commit d0785f0) but I am facing the following issue:
Here the code (BinaryClassification of Titanic Dataset=passengersData, targetColumn is Survived)
Attached the log and the error
logxg.txt
The text was updated successfully, but these errors were encountered: