Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBM: NullPointerException #405

Closed
alois-bissuel opened this issue Oct 23, 2018 · 11 comments
Closed

LightGBM: NullPointerException #405

alois-bissuel opened this issue Oct 23, 2018 · 11 comments

Comments

@alois-bissuel
Copy link

When training on a very small dataset, I get some Java NullPointerException after quite some time during the first reduce job.

From the call stack, it seems that it is related to the network (see TrainUtils.scala#208).

I can't rule out any problem associated with my solving of mmlspark#390.

Any ideas or further test which I could run ?

Thanks in advance.

P.S. Here is the associated call stack
java.lang.NullPointerException at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:208) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:60) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:60) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:202) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:199) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

@imatiach-msft
Copy link
Contributor

@alois-bissuel Sorry about the trouble you are having. I wonder if it could be the "nodes.split(",").length", but then it should be easy to see based on this logged line if that is the issue:
https://github.com/Azure/mmlspark/blob/7eed833156f507357c8b43aefcf77ddfab0f2be9/src/lightgbm/src/main/scala/TrainUtils.scala#L182
could you please send me the full yarn logs/stderr/stdout at mmlspark-support@microsoft.com or post a link to them here on github?
It could also be that your dll does not match the source exactly, although it looks like this file hasn't been modified since Aug 27, so that is probably not the case:
https://github.com/Azure/mmlspark/commits/7eed833156f507357c8b43aefcf77ddfab0f2be9/src/lightgbm/src/main/scala/TrainUtils.scala

Look forward to getting your response and resolving this mystery.

@alois-bissuel
Copy link
Author

Thanks for the quick answer. I have sent you the files on the mail address you provided.

For the record, it seems that there are indeed some network issues:
18/10/23 17:05:04 INFO spark.LightGBMClassifier: LightGBM worker connecting to host: 10.224.121.20 and port: 37789
18/10/23 17:05:04 INFO spark.LightGBMClassifier: LightGBM worker connecting to host: 10.224.121.20 and port: 37789
18/10/23 17:05:04 INFO spark.LightGBMClassifier: LightGBM worker connecting to host: 10.224.121.20 and port: 37789
18/10/23 17:05:04 INFO spark.LightGBMClassifier: LightGBM worker connecting to host: 10.224.121.20 and port: 37789
18/10/23 17:05:04 INFO spark.LightGBMClassifier: LightGBM worker connecting to host: 10.224.121.20 and port: 37789
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker got nodes for network init: null
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker got nodes for network init: null
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker got nodes for network init: null
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker got nodes for network init: null
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker got nodes for network init: null
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker listening on: 12448
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker listening on: 12449
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker listening on: 12450
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker listening on: 12452
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker listening on: 12451

@imatiach-msft
Copy link
Contributor

@alois-bissuel thanks, this line seems very suspicious:

got nodes for network init: null

clearly something went wrong, the workers should have received the available nodes from the driver, not really sure how this is possible yet

@alois-bissuel
Copy link
Author

OK. Any other checks I could make ?

@imatiach-msft
Copy link
Contributor

@imatiach-msft
Copy link
Contributor

also, I see this line:
https://github.com/Azure/mmlspark/blob/e02045508aa107a79f32ae88209b9584fb202f03/src/lightgbm/src/main/scala/LightGBMUtils.scala#L80
but I never see this line:
https://github.com/Azure/mmlspark/blob/e02045508aa107a79f32ae88209b9584fb202f03/src/lightgbm/src/main/scala/LightGBMUtils.scala#L90
which I think is the problem. Somehow driver is not writing back to all connections, but the workers are getting something back which they think is null. Very strange.

@imatiach-msft
Copy link
Contributor

@alois-bissuel let me update the 2.2.1 release to latest and create a new build for you, so you can get the new changes

@imatiach-msft
Copy link
Contributor

@alois-bissuel I started a new build here: #391 after updating to latest master. I'm not sure if that will fix your issue though because it looks like there is some problem with socket communication between worker and driver where worker is getting something that driver didn't send. Maybe I need to add more debug there next.

@alois-bissuel
Copy link
Author

OK, it looks like the network error is solved. Thanks a lot ! I will get back to issue #390 as there are now other errors.

For the record, now the network lines look like the following:
18/10/26 14:28:28 INFO spark.LightGBMClassifier: send current worker info to driver: 10.225.4.20:12419
18/10/26 14:28:49 INFO spark.LightGBMClassifier: LightGBM worker got nodes for network init: 10.225.4.20:12415,10.225.4.20:12417,10.225.4.20:12419,10.225.4.20:12416,10.225.4.20:12418,10.224.221.11:12407,10.224.221.11:12406,10.224.221.11:12408,10.224.221.11:12405,10.224.24.24:12410,10.224.24.24:12413,10.224.24.24:12412,10.224.24.24:12411,10.224.24.24:12414,10.224.221.11:12409,10.224.242.19:12420,10.224.242.19:12422,10.224.242.19:12421,10.224.242.19:12423,10.224.242.19:12424

@sully90
Copy link

sully90 commented Mar 8, 2019

I'm also getting this issue with mmlspark:v0.16 and LightGBM 2.2.2. Our spark cluster is running on centOS 7 therefore I had to manually build LightGMB with SWIG (due to this issue microsoft/LightGBM#1945). The library works when using spark locally, but fails when used with more that one node. Here is the stack trace:

[Stage 3:>                                                        (0 + 12) / 16]2019-03-08 07:35:23 WARN  TaskSetManager:66 - Lost task 3.0 in stage 3.0 (TID 243, 178.63.65.13, executor 1): java.lang.NullPointerException
at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:218)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

And spark worker logs show the same errors as @alois-bissuel T:

2019-03-08 07:35:22 INFO  LightGBMClassifier:192 - LightGBM worker got nodes for network init: null
2019-03-08 07:35:22 INFO  LightGBMClassifier:192 - LightGBM worker got nodes for network init: null
2019-03-08 07:35:22 INFO  LightGBMClassifier:192 - LightGBM worker got nodes for network init: null
2019-03-08 07:35:22 INFO  LightGBMClassifier:192 - LightGBM worker got nodes for network init: null
2019-03-08 07:35:22 INFO  LightGBMClassifier:192 - LightGBM worker got nodes for network init: null

Any help would be greatly appreciated!
Thanks.

@x52xiao
Copy link

x52xiao commented Nov 18, 2019

Hello. I'm using 0.18.1 but I'm still seeing exactly the same issue. Please advise. @imatiach-msft
LightGBMClassifier.apply(TrainUtils.scala:355): LightGBM worker got nodes for network init: null
[Executor task launch worker for task 12] Executor.logError(Logging.scala:91): Exception in task 0.0 in stage 12.0 (TID 12)
java.lang.NullPointerException
at com.microsoft.ml.spark.lightgbm.TrainUtils$.networkInit(TrainUtils.scala:367)
at com.microsoft.ml.spark.lightgbm.TrainUtils$.networkInit(TrainUtils.scala:374)
at com.microsoft.ml.spark.lightgbm.TrainUtils$.networkInit(TrainUtils.scala:374)
at com.microsoft.ml.spark.lightgbm.TrainUtils$.networkInit(TrainUtils.scala:374)
at com.microsoft.ml.spark.lightgbm.TrainUtils$.trainLightGBM(TrainUtils.scala:410)
at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85)
at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants