-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LightGBM: NullPointerException #405
Comments
@alois-bissuel Sorry about the trouble you are having. I wonder if it could be the "nodes.split(",").length", but then it should be easy to see based on this logged line if that is the issue: Look forward to getting your response and resolving this mystery. |
Thanks for the quick answer. I have sent you the files on the mail address you provided. For the record, it seems that there are indeed some network issues: |
@alois-bissuel thanks, this line seems very suspicious:
clearly something went wrong, the workers should have received the available nodes from the driver, not really sure how this is possible yet |
OK. Any other checks I could make ? |
@alois-bissuel one thing is strange, in your worker logs I don't see this line: |
also, I see this line: |
@alois-bissuel let me update the 2.2.1 release to latest and create a new build for you, so you can get the new changes |
@alois-bissuel I started a new build here: #391 after updating to latest master. I'm not sure if that will fix your issue though because it looks like there is some problem with socket communication between worker and driver where worker is getting something that driver didn't send. Maybe I need to add more debug there next. |
OK, it looks like the network error is solved. Thanks a lot ! I will get back to issue #390 as there are now other errors. For the record, now the network lines look like the following: |
I'm also getting this issue with mmlspark:v0.16 and LightGBM 2.2.2. Our spark cluster is running on centOS 7 therefore I had to manually build LightGMB with SWIG (due to this issue microsoft/LightGBM#1945). The library works when using spark locally, but fails when used with more that one node. Here is the stack trace:
And spark worker logs show the same errors as @alois-bissuel T:
Any help would be greatly appreciated! |
Hello. I'm using 0.18.1 but I'm still seeing exactly the same issue. Please advise. @imatiach-msft |
When training on a very small dataset, I get some Java NullPointerException after quite some time during the first reduce job.
From the call stack, it seems that it is related to the network (see TrainUtils.scala#208).
I can't rule out any problem associated with my solving of mmlspark#390.
Any ideas or further test which I could run ?
Thanks in advance.
P.S. Here is the associated call stack
java.lang.NullPointerException at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:208) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:60) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:60) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:202) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:199) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
The text was updated successfully, but these errors were encountered: