Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

key not found #406

Closed
tanjiaxin opened this issue Oct 24, 2018 · 21 comments
Closed

key not found #406

tanjiaxin opened this issue Oct 24, 2018 · 21 comments
Assignees

Comments

@tanjiaxin
Copy link

tanjiaxin commented Oct 24, 2018

hello, I'm trying to use lightGBM on the standalone mode spark cluster.
I have got some data on the hdfs, lightGBMClassifier works fine when I use part of the data to train the model , but when I use all the data it will come out the error below.
error.log
And I also tried to use the same part of data to run a cross validate, it sometimes get the same error above and sometimes can get the result.
I'm working on :
spark 2.3.1 python3.6.5
adclick-Copy1.zip
above is the notebook file I used to submit the application.
Could you please help me to find out what is the problem?

@imatiach-msft
Copy link
Contributor

@tanjiaxin Sorry about the issue you are having, I believe this has been fixed here:
#399
see related issue:
#397
The error message was the same.
You can use latest private build here:
#404
Maven package uploaded, use

--packages
com.microsoft.ml.spark:mmlspark_2.11:0.14.dev13+1.g58a2027c
and --repositories
https://mmlspark.azureedge.net/maven

The fix should be in the next v0.15 release.

@tanjiaxin
Copy link
Author

@imatiach-msft thanks for you help, I will try it again.

@tanjiaxin
Copy link
Author

tanjiaxin commented Oct 25, 2018

@imatiach-msft
I have tried your private build and anothor error came out, also I have read #379 ,
but this error also should have been fix in your private build as you mentioned in #404 .
here is my error log:
error_connection_refused.log
is there any other problem could raise this error?

@imatiach-msft
Copy link
Contributor

@tanjiaxin could you try disabling autoscale? it is currently not supported. I wonder if that is causing the errors.

@tanjiaxin
Copy link
Author

tanjiaxin commented Oct 25, 2018

@imatiach-msft I will change the setting and run it again.
I have searched the autoscale setting on spark, but I'm working on a standalone cluster on which I could only found the "spark.dynamicAllocation.enabled" setting and this setting is disabled by default .

@imatiach-msft
Copy link
Contributor

@tanjiaxin ok, that is the autoscale setting, so it is disabled. It sounds like you are encountering another issue then. It is not clear to me what the issue is from the logs because the connection refused error is a red herring, there should be another error on one of the workers which is the real exception.

@tanjiaxin
Copy link
Author

@imatiach-msft I'm going to check all my spark logs, thanks for your patient.

@tanjiaxin
Copy link
Author

@imatiach-msft I do have found a OOM error in a node, the logs are below:
stderr.log
stdout.log
hs_err_pid21665.log
it saids "There is insufficient memory for the Java Runtime Environment to continue.".
I believe this is not lightGBM's problem.
This time I have used 6 node with 8 core and 13G memory , my data size is about 6 GB. Do you have any sugestion about this situation?

@imatiach-msft imatiach-msft self-assigned this Oct 26, 2018
@imatiach-msft
Copy link
Contributor

hi @tanjiaxin , sorry, this is an issue with lightgbm - the dataset on each partition is replicated in native memory (so native lightgbm code can run), so at minimum lightgbm takes about 2X dataset size to train.
You could try two things:
1.) increase the memory of the cluster
2.) use incremental training with lightgbm: you can split up your dataset, run lightgbm on the first split, save the native learner, and then retrain on the next dataset passing in the lightgbm learner param
Sorry about the inconvenience.

@imatiach-msft
Copy link
Contributor

@tanjiaxin see related conversation here:
#390
copy-pasting for reference:
I wouldn't rule out there being a memory leak in the native code, but I do delete the arrays to create the native lightgbm dataset here for training:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMUtils.scala#L318
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMUtils.scala#L359
and here for prediction:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMBooster.scala#L67
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMBooster.scala#L106
here for label col:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/TrainUtils.scala#L62
and here to free learner:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/TrainUtils.scala#L111
and here to free dataset after training is done:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/TrainUtils.scala#L118
If I missed something somewhere then that could be an issue. But I'm not sure what else I could have missed, these are the only native constructs created during training

@tanjiaxin
Copy link
Author

@imatiach-msft I will try it, thanks for you help

@tanjiaxin
Copy link
Author

tanjiaxin commented Oct 29, 2018

@imatiach-msft I have found a incremental training example: https://gist.github.com/goraj/6df8f22a49534e042804a299d81bf2d6
but I can't figure out where should I pass the init_model in mmlspark's lightgbm.
Could you please tell me how or give me a example ?
I have find a way to pass the model the code is below:
model = LightGBMClassifier(learningRate=0.05, numIterations=100, numLeaves=26).fit(train, params={'model': model})
is that correct?

@imatiach-msft
Copy link
Contributor

@tanjiaxin assuming you are using pyspark based on the example above, you can use modelString (from this source in scala):
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMParams.scala#L109
The python code should look like:
myModelString = <get model string from previous model, eg saved to file by saveNativeModel on model from previous train>
model = LightGBMClassifier(learningRate=0.05, numIterations=100, numLeaves=26, modelString=myModelString).fit(train)

Otherwise, you can always rescale the cluster to a larger size which should handle the full dataset.

@tanjiaxin
Copy link
Author

tanjiaxin commented Oct 30, 2018

Hi @imatiach-msft Ihave get a error : " Model file doesn't specify the number of classes".
In my understanding, num_class should be set when I use objective="multiclass", but I' m using objective="binary".
Also I can't figure out how to set the num_class in mmlspark's lightGBM.
logs:
java.lang.Exception: Booster LoadFromString call failed in LightGBM with error: Model file doesn't specify the number of classes at com.microsoft.ml.spark.LightGBMUtils$.validate(LightGBMUtils.scala:26) at com.microsoft.ml.spark.LightGBMUtils$.getBoosterPtrFromModelString(LightGBMUtils.scala:53) at com.microsoft.ml.spark.TrainUtils$.translate(TrainUtils.scala:75) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:211) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:58) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:58) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

@imatiach-msft
Copy link
Contributor

@tanjiaxin can you please send me the model file? Usually I get this error when the file is invalid (eg a blank string). You shouldn't have to set num_class.

@tanjiaxin
Copy link
Author

tanjiaxin commented Oct 30, 2018

@imatiach-msft Sorry for the trouble, I think it's my problem, the model saved as a dir, of course I can't get the num_class.

part-00000-02b3b4dd-d082-45e0-8463-55bed1d177e2-c000.zip
above is the model I saved. Still I can't load it for the same error. Could you try to load it ?

@tanjiaxin
Copy link
Author

@imatiach-msft I'm still can't fix the prroblem, I have viewed the model file, it do have a line said"num_class=1", I create the instance of lightGBMClassifier withlgb = LightGBMClassifier(learningRate=0.05, numIterations=100, modelString="hdfs://hz-ecom-wordspark-01:8020/RD_data/lightgbm_model/step1/part-00000-02b3b4dd-d082-45e0-8463-55bed1d177e2-c000.txt", objective="binary", numLeaves=26)
but when I try to use fit the error would come out.

@imatiach-msft
Copy link
Contributor

@tanjiaxin sorry, I must have confused you, the model string is the actual string contents, not the file path. You would have to read from the file and then pass the string contents to the learner. That is probably why you are getting the error. Also, if you prefer, maybe we can try and resolve this over a skype call? You can email mmlspark-support@microsoft.com and I can invite you to a meeting.

@tanjiaxin
Copy link
Author

@imatiach-msft I have solved the problem according your answer, but I have a doublt that what is difference between training the model with whole dataset and incremental training with part of dataset .
Recently I have heard a view that those kind of gbm models are designed to be trained with whole dataset, now I'm using incremental training with lightGBM, would this cause some side effect.

@imatiach-msft
Copy link
Contributor

@tanjiaxin I think the post here might be relevant from the main developer of LightGBM, this would apply to any partial dataset. There might be other reasons that accuracy could drop as well.

@tanjiaxin
Copy link
Author

@imatiach-msft Thanks very much for your help and patient, I have incremental trianed a model with the whole dataset .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants