key not found #406

tanjiaxin · 2018-10-24T08:31:39Z

hello, I'm trying to use lightGBM on the standalone mode spark cluster.
I have got some data on the hdfs, lightGBMClassifier works fine when I use part of the data to train the model , but when I use all the data it will come out the error below.
error.log
And I also tried to use the same part of data to run a cross validate, it sometimes get the same error above and sometimes can get the result.
I'm working on :
spark 2.3.1 python3.6.5
adclick-Copy1.zip
above is the notebook file I used to submit the application.
Could you please help me to find out what is the problem?

imatiach-msft · 2018-10-24T14:28:05Z

@tanjiaxin Sorry about the issue you are having, I believe this has been fixed here:
#399
see related issue:
#397
The error message was the same.
You can use latest private build here:
#404
Maven package uploaded, use

--packages
com.microsoft.ml.spark:mmlspark_2.11:0.14.dev13+1.g58a2027c
and --repositories
https://mmlspark.azureedge.net/maven

The fix should be in the next v0.15 release.

tanjiaxin · 2018-10-24T15:32:16Z

@imatiach-msft thanks for you help, I will try it again.

tanjiaxin · 2018-10-25T03:07:58Z

@imatiach-msft
I have tried your private build and anothor error came out, also I have read #379 ,
but this error also should have been fix in your private build as you mentioned in #404 .
here is my error log:
error_connection_refused.log
is there any other problem could raise this error?

imatiach-msft · 2018-10-25T03:33:05Z

@tanjiaxin could you try disabling autoscale? it is currently not supported. I wonder if that is causing the errors.

tanjiaxin · 2018-10-25T03:37:15Z

@imatiach-msft I will change the setting and run it again.
I have searched the autoscale setting on spark, but I'm working on a standalone cluster on which I could only found the "spark.dynamicAllocation.enabled" setting and this setting is disabled by default .

imatiach-msft · 2018-10-25T14:23:20Z

@tanjiaxin ok, that is the autoscale setting, so it is disabled. It sounds like you are encountering another issue then. It is not clear to me what the issue is from the logs because the connection refused error is a red herring, there should be another error on one of the workers which is the real exception.

tanjiaxin · 2018-10-26T02:23:32Z

@imatiach-msft I'm going to check all my spark logs, thanks for your patient.

tanjiaxin · 2018-10-26T06:52:52Z

@imatiach-msft I do have found a OOM error in a node, the logs are below:
stderr.log
stdout.log
hs_err_pid21665.log
it saids "There is insufficient memory for the Java Runtime Environment to continue.".
I believe this is not lightGBM's problem.
This time I have used 6 node with 8 core and 13G memory , my data size is about 6 GB. Do you have any sugestion about this situation?

imatiach-msft · 2018-10-26T15:15:41Z

hi @tanjiaxin , sorry, this is an issue with lightgbm - the dataset on each partition is replicated in native memory (so native lightgbm code can run), so at minimum lightgbm takes about 2X dataset size to train.
You could try two things:
1.) increase the memory of the cluster
2.) use incremental training with lightgbm: you can split up your dataset, run lightgbm on the first split, save the native learner, and then retrain on the next dataset passing in the lightgbm learner param
Sorry about the inconvenience.

imatiach-msft · 2018-10-26T15:42:57Z

@tanjiaxin see related conversation here:
#390
copy-pasting for reference:
I wouldn't rule out there being a memory leak in the native code, but I do delete the arrays to create the native lightgbm dataset here for training:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMUtils.scala#L318
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMUtils.scala#L359
and here for prediction:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMBooster.scala#L67
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMBooster.scala#L106
here for label col:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/TrainUtils.scala#L62
and here to free learner:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/TrainUtils.scala#L111
and here to free dataset after training is done:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/TrainUtils.scala#L118
If I missed something somewhere then that could be an issue. But I'm not sure what else I could have missed, these are the only native constructs created during training

tanjiaxin · 2018-10-28T03:55:36Z

@imatiach-msft I will try it, thanks for you help

tanjiaxin · 2018-10-29T09:14:08Z

@imatiach-msft I have found a incremental training example: https://gist.github.com/goraj/6df8f22a49534e042804a299d81bf2d6
but I can't figure out where should I pass the init_model in mmlspark's lightgbm.
Could you please tell me how or give me a example ?
I have find a way to pass the model the code is below:
model = LightGBMClassifier(learningRate=0.05, numIterations=100, numLeaves=26).fit(train, params={'model': model})
is that correct?

imatiach-msft · 2018-10-29T14:25:55Z

@tanjiaxin assuming you are using pyspark based on the example above, you can use modelString (from this source in scala):
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMParams.scala#L109
The python code should look like:
myModelString = <get model string from previous model, eg saved to file by saveNativeModel on model from previous train>
model = LightGBMClassifier(learningRate=0.05, numIterations=100, numLeaves=26, modelString=myModelString).fit(train)

Otherwise, you can always rescale the cluster to a larger size which should handle the full dataset.

tanjiaxin · 2018-10-30T03:21:29Z

Hi @imatiach-msft Ihave get a error : " Model file doesn't specify the number of classes".
In my understanding, num_class should be set when I use objective="multiclass", but I' m using objective="binary".
Also I can't figure out how to set the num_class in mmlspark's lightGBM.
logs:
java.lang.Exception: Booster LoadFromString call failed in LightGBM with error: Model file doesn't specify the number of classes at com.microsoft.ml.spark.LightGBMUtils$.validate(LightGBMUtils.scala:26) at com.microsoft.ml.spark.LightGBMUtils$.getBoosterPtrFromModelString(LightGBMUtils.scala:53) at com.microsoft.ml.spark.TrainUtils$.translate(TrainUtils.scala:75) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:211) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:58) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:58) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

imatiach-msft · 2018-10-30T03:29:34Z

@tanjiaxin can you please send me the model file? Usually I get this error when the file is invalid (eg a blank string). You shouldn't have to set num_class.

tanjiaxin · 2018-10-30T03:41:06Z

@imatiach-msft Sorry for the trouble, I think it's my problem, the model saved as a dir, of course I can't get the num_class.

part-00000-02b3b4dd-d082-45e0-8463-55bed1d177e2-c000.zip
above is the model I saved. Still I can't load it for the same error. Could you try to load it ?

tanjiaxin · 2018-10-31T08:58:41Z

@imatiach-msft I'm still can't fix the prroblem, I have viewed the model file, it do have a line said"num_class=1", I create the instance of lightGBMClassifier withlgb = LightGBMClassifier(learningRate=0.05, numIterations=100, modelString="hdfs://hz-ecom-wordspark-01:8020/RD_data/lightgbm_model/step1/part-00000-02b3b4dd-d082-45e0-8463-55bed1d177e2-c000.txt", objective="binary", numLeaves=26)
but when I try to use fit the error would come out.

imatiach-msft · 2018-10-31T14:34:21Z

@tanjiaxin sorry, I must have confused you, the model string is the actual string contents, not the file path. You would have to read from the file and then pass the string contents to the learner. That is probably why you are getting the error. Also, if you prefer, maybe we can try and resolve this over a skype call? You can email mmlspark-support@microsoft.com and I can invite you to a meeting.

tanjiaxin · 2018-11-01T03:33:16Z

@imatiach-msft I have solved the problem according your answer, but I have a doublt that what is difference between training the model with whole dataset and incremental training with part of dataset .
Recently I have heard a view that those kind of gbm models are designed to be trained with whole dataset, now I'm using incremental training with lightGBM, would this cause some side effect.

imatiach-msft · 2018-11-01T04:02:46Z

@tanjiaxin I think the post here might be relevant from the main developer of LightGBM, this would apply to any partial dataset. There might be other reasons that accuracy could drop as well.

tanjiaxin · 2018-11-01T13:54:33Z

@imatiach-msft Thanks very much for your help and patient, I have incremental trianed a model with the whole dataset .

tanjiaxin mentioned this issue Oct 25, 2018

Training stops with "java.net.ConnectException: Connection refused" message #379

Closed

imatiach-msft self-assigned this Oct 26, 2018

imatiach-msft mentioned this issue Oct 26, 2018

New release with LightGBM 2.2.1 #390

Closed

tanjiaxin closed this as completed Nov 2, 2018

imatiach-msft mentioned this issue Nov 5, 2018

Single Column Bug #423

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

key not found #406

key not found #406

tanjiaxin commented Oct 24, 2018 •

edited

Loading

imatiach-msft commented Oct 24, 2018

tanjiaxin commented Oct 24, 2018

tanjiaxin commented Oct 25, 2018 •

edited

Loading

imatiach-msft commented Oct 25, 2018

tanjiaxin commented Oct 25, 2018 •

edited

Loading

imatiach-msft commented Oct 25, 2018

tanjiaxin commented Oct 26, 2018

tanjiaxin commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

tanjiaxin commented Oct 28, 2018

tanjiaxin commented Oct 29, 2018 •

edited

Loading

imatiach-msft commented Oct 29, 2018

tanjiaxin commented Oct 30, 2018 •

edited

Loading

imatiach-msft commented Oct 30, 2018

tanjiaxin commented Oct 30, 2018 •

edited

Loading

tanjiaxin commented Oct 31, 2018

imatiach-msft commented Oct 31, 2018

tanjiaxin commented Nov 1, 2018

imatiach-msft commented Nov 1, 2018

tanjiaxin commented Nov 1, 2018

key not found #406

key not found #406

Comments

tanjiaxin commented Oct 24, 2018 • edited Loading

imatiach-msft commented Oct 24, 2018

tanjiaxin commented Oct 24, 2018

tanjiaxin commented Oct 25, 2018 • edited Loading

imatiach-msft commented Oct 25, 2018

tanjiaxin commented Oct 25, 2018 • edited Loading

imatiach-msft commented Oct 25, 2018

tanjiaxin commented Oct 26, 2018

tanjiaxin commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

tanjiaxin commented Oct 28, 2018

tanjiaxin commented Oct 29, 2018 • edited Loading

imatiach-msft commented Oct 29, 2018

tanjiaxin commented Oct 30, 2018 • edited Loading

imatiach-msft commented Oct 30, 2018

tanjiaxin commented Oct 30, 2018 • edited Loading

tanjiaxin commented Oct 31, 2018

imatiach-msft commented Oct 31, 2018

tanjiaxin commented Nov 1, 2018

imatiach-msft commented Nov 1, 2018

tanjiaxin commented Nov 1, 2018

tanjiaxin commented Oct 24, 2018 •

edited

Loading

tanjiaxin commented Oct 25, 2018 •

edited

Loading

tanjiaxin commented Oct 25, 2018 •

edited

Loading

tanjiaxin commented Oct 29, 2018 •

edited

Loading

tanjiaxin commented Oct 30, 2018 •

edited

Loading

tanjiaxin commented Oct 30, 2018 •

edited

Loading