LightGBM: NullPointerException #405

alois-bissuel · 2018-10-23T17:10:41Z

When training on a very small dataset, I get some Java NullPointerException after quite some time during the first reduce job.

From the call stack, it seems that it is related to the network (see TrainUtils.scala#208).

I can't rule out any problem associated with my solving of mmlspark#390.

Any ideas or further test which I could run ?

Thanks in advance.

P.S. Here is the associated call stack
java.lang.NullPointerException at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:208) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:60) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:60) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:202) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:199) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

imatiach-msft · 2018-10-23T17:27:33Z

@alois-bissuel Sorry about the trouble you are having. I wonder if it could be the "nodes.split(",").length", but then it should be easy to see based on this logged line if that is the issue:
https://github.com/Azure/mmlspark/blob/7eed833156f507357c8b43aefcf77ddfab0f2be9/src/lightgbm/src/main/scala/TrainUtils.scala#L182
could you please send me the full yarn logs/stderr/stdout at mmlspark-support@microsoft.com or post a link to them here on github?
It could also be that your dll does not match the source exactly, although it looks like this file hasn't been modified since Aug 27, so that is probably not the case:
https://github.com/Azure/mmlspark/commits/7eed833156f507357c8b43aefcf77ddfab0f2be9/src/lightgbm/src/main/scala/TrainUtils.scala

Look forward to getting your response and resolving this mystery.

alois-bissuel · 2018-10-24T09:20:56Z

Thanks for the quick answer. I have sent you the files on the mail address you provided.

For the record, it seems that there are indeed some network issues:
18/10/23 17:05:04 INFO spark.LightGBMClassifier: LightGBM worker connecting to host: 10.224.121.20 and port: 37789
18/10/23 17:05:04 INFO spark.LightGBMClassifier: LightGBM worker connecting to host: 10.224.121.20 and port: 37789
18/10/23 17:05:04 INFO spark.LightGBMClassifier: LightGBM worker connecting to host: 10.224.121.20 and port: 37789
18/10/23 17:05:04 INFO spark.LightGBMClassifier: LightGBM worker connecting to host: 10.224.121.20 and port: 37789
18/10/23 17:05:04 INFO spark.LightGBMClassifier: LightGBM worker connecting to host: 10.224.121.20 and port: 37789
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker got nodes for network init: null
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker got nodes for network init: null
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker got nodes for network init: null
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker got nodes for network init: null
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker got nodes for network init: null
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker listening on: 12448
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker listening on: 12449
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker listening on: 12450
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker listening on: 12452
18/10/23 17:07:51 INFO spark.LightGBMClassifier: LightGBM worker listening on: 12451

imatiach-msft · 2018-10-24T14:29:45Z

@alois-bissuel thanks, this line seems very suspicious:

got nodes for network init: null

clearly something went wrong, the workers should have received the available nodes from the driver, not really sure how this is possible yet

alois-bissuel · 2018-10-25T12:16:36Z

OK. Any other checks I could make ?

imatiach-msft · 2018-10-26T04:39:56Z

@alois-bissuel one thing is strange, in your worker logs I don't see this line:
https://github.com/Azure/mmlspark/blob/bd4f3ca54018c434c3065510e8dfc78be2d4dcfb/src/lightgbm/src/main/scala/TrainUtils.scala#L178
However, I do see this line:
https://github.com/Azure/mmlspark/blob/bd4f3ca54018c434c3065510e8dfc78be2d4dcfb/src/lightgbm/src/main/scala/TrainUtils.scala#L184
Ah, I think it's because you must not have this fix:
#399

imatiach-msft · 2018-10-26T04:42:41Z

also, I see this line:
https://github.com/Azure/mmlspark/blob/e02045508aa107a79f32ae88209b9584fb202f03/src/lightgbm/src/main/scala/LightGBMUtils.scala#L80
but I never see this line:
https://github.com/Azure/mmlspark/blob/e02045508aa107a79f32ae88209b9584fb202f03/src/lightgbm/src/main/scala/LightGBMUtils.scala#L90
which I think is the problem. Somehow driver is not writing back to all connections, but the workers are getting something back which they think is null. Very strange.

imatiach-msft · 2018-10-26T04:44:13Z

@alois-bissuel let me update the 2.2.1 release to latest and create a new build for you, so you can get the new changes

imatiach-msft · 2018-10-26T04:47:48Z

@alois-bissuel I started a new build here: #391 after updating to latest master. I'm not sure if that will fix your issue though because it looks like there is some problem with socket communication between worker and driver where worker is getting something that driver didn't send. Maybe I need to add more debug there next.

alois-bissuel · 2018-10-26T14:38:58Z

OK, it looks like the network error is solved. Thanks a lot ! I will get back to issue #390 as there are now other errors.

For the record, now the network lines look like the following:
18/10/26 14:28:28 INFO spark.LightGBMClassifier: send current worker info to driver: 10.225.4.20:12419
18/10/26 14:28:49 INFO spark.LightGBMClassifier: LightGBM worker got nodes for network init: 10.225.4.20:12415,10.225.4.20:12417,10.225.4.20:12419,10.225.4.20:12416,10.225.4.20:12418,10.224.221.11:12407,10.224.221.11:12406,10.224.221.11:12408,10.224.221.11:12405,10.224.24.24:12410,10.224.24.24:12413,10.224.24.24:12412,10.224.24.24:12411,10.224.24.24:12414,10.224.221.11:12409,10.224.242.19:12420,10.224.242.19:12422,10.224.242.19:12421,10.224.242.19:12423,10.224.242.19:12424

sully90 · 2019-03-08T13:13:11Z

I'm also getting this issue with mmlspark:v0.16 and LightGBM 2.2.2. Our spark cluster is running on centOS 7 therefore I had to manually build LightGMB with SWIG (due to this issue microsoft/LightGBM#1945). The library works when using spark locally, but fails when used with more that one node. Here is the stack trace:

[Stage 3:>                                                        (0 + 12) / 16]2019-03-08 07:35:23 WARN  TaskSetManager:66 - Lost task 3.0 in stage 3.0 (TID 243, 178.63.65.13, executor 1): java.lang.NullPointerException
at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:218)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

And spark worker logs show the same errors as @alois-bissuel T:

2019-03-08 07:35:22 INFO  LightGBMClassifier:192 - LightGBM worker got nodes for network init: null
2019-03-08 07:35:22 INFO  LightGBMClassifier:192 - LightGBM worker got nodes for network init: null
2019-03-08 07:35:22 INFO  LightGBMClassifier:192 - LightGBM worker got nodes for network init: null
2019-03-08 07:35:22 INFO  LightGBMClassifier:192 - LightGBM worker got nodes for network init: null
2019-03-08 07:35:22 INFO  LightGBMClassifier:192 - LightGBM worker got nodes for network init: null

Any help would be greatly appreciated!
Thanks.

x52xiao · 2019-11-18T11:34:12Z

Hello. I'm using 0.18.1 but I'm still seeing exactly the same issue. Please advise. @imatiach-msft
LightGBMClassifier.apply(TrainUtils.scala:355): LightGBM worker got nodes for network init: null
[Executor task launch worker for task 12] Executor.logError(Logging.scala:91): Exception in task 0.0 in stage 12.0 (TID 12)
java.lang.NullPointerException
at com.microsoft.ml.spark.lightgbm.TrainUtils$.networkInit(TrainUtils.scala:367)
at com.microsoft.ml.spark.lightgbm.TrainUtils$.networkInit(TrainUtils.scala:374)
at com.microsoft.ml.spark.lightgbm.TrainUtils$.networkInit(TrainUtils.scala:374)
at com.microsoft.ml.spark.lightgbm.TrainUtils$.networkInit(TrainUtils.scala:374)
at com.microsoft.ml.spark.lightgbm.TrainUtils$.trainLightGBM(TrainUtils.scala:410)
at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85)
at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185)

alois-bissuel closed this as completed Oct 26, 2018

alois-bissuel mentioned this issue Oct 26, 2018

New release with LightGBM 2.2.1 #390

Closed

sully90 mentioned this issue Mar 13, 2019

LightGBM: NullPointerException in version 0.16 #514

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightGBM: NullPointerException #405

LightGBM: NullPointerException #405

alois-bissuel commented Oct 23, 2018

imatiach-msft commented Oct 23, 2018

alois-bissuel commented Oct 24, 2018

imatiach-msft commented Oct 24, 2018

alois-bissuel commented Oct 25, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

alois-bissuel commented Oct 26, 2018

sully90 commented Mar 8, 2019

x52xiao commented Nov 18, 2019 •

edited

Loading

LightGBM: NullPointerException #405

LightGBM: NullPointerException #405

Comments

alois-bissuel commented Oct 23, 2018

imatiach-msft commented Oct 23, 2018

alois-bissuel commented Oct 24, 2018

imatiach-msft commented Oct 24, 2018

alois-bissuel commented Oct 25, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

alois-bissuel commented Oct 26, 2018

sully90 commented Mar 8, 2019

x52xiao commented Nov 18, 2019 • edited Loading

x52xiao commented Nov 18, 2019 •

edited

Loading