Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBM stuck at "reduce at LightGBMClassifier.scala:150" #1053

Open
OldDreamHunter opened this issue May 20, 2021 · 11 comments
Open

LightGBM stuck at "reduce at LightGBMClassifier.scala:150" #1053

OldDreamHunter opened this issue May 20, 2021 · 11 comments

Comments

@OldDreamHunter
Copy link

OldDreamHunter commented May 20, 2021

I have already noticed the issue #542, but the answer cannot solve my problem.

I have a dataset nearly 72GB and 145 columns. My spark config is
spark-submit
--master yarn
--deploy-mode client
--executor-memory 15g
--driver-memory 15g
--executor-cores 1
--num-executors 20
--packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1
--conf spark.default.parallelism=5000
--conf spark.sql.shuffle.partitions=5000
--conf spark.dynamicAllocation.enabled=false
--conf spark.memory.storageFraction=0.3
--conf spark.executor.memoryOverhead=15g
--conf spark.driver.maxResultSize=10g \

if I reduce the dataset size to 24 GB, I could train the model in 40 minutes. But if I increase the dataset to 72GB, the training process would be stuck at "reduce at LightGBMClassifier.scala:150" and report some failed information, "ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 128370 ms", "java.lang.Exception: Dataset create call failed in LightGBM with error: Socket recv error, code: 104", "java.net.ConnectException: Connection refused"

AB#1188553

@welcome
Copy link

welcome bot commented May 20, 2021

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

@imatiach-msft
Copy link
Contributor

hi @OldDreamHunter sorry about the trouble you are having. Have you tried increasing the socket timeout:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L47
What are the parameters to lightgbm?

@OldDreamHunter
Copy link
Author

hi @OldDreamHunter sorry about the trouble you are having. Have you tried increasing the socket timeout:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L47
What are the parameters to lightgbm?

Thanks for your reply @imatiach-msft , I don't increase the socket timeout and would try it. And the parameters of my model
as described below.

lgb = LightGBMClassifier(
objective="binary",
boostingType='gbdt',
isUnbalance=True,
featuresCol='features',
labelCol='label',
maxBin=64,
earlyStoppingRound=100,
learningRate=0.5,
maxDepth=6,
numLeaves=48,
lambdaL1=0.8,
lambdaL2=45.0,
baggingFraction=0.7,
featureFraction=0.7,
numIterations=200)

@OldDreamHunter
Copy link
Author

hi @OldDreamHunter sorry about the trouble you are having. Have you tried increasing the socket timeout:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L47
What are the parameters to lightgbm?

hi, @imatiach-msft, I have increased the timeout and changed the parallelism type to "voting_parallel", but the job still failed as "reduce at LightGBMBase.scala:230" with the failure reason of "Job aborted due to stage failure: Task 8 in stage 4.0 failed 4 times, most recent failure: Lost task 8.3 in stage 4.0 (TID 6027, pro-dchadoop-195-81, executor 22): java.net.ConnectException: Connection refused (Connection refused)"

boostingType='gbdt',
isUnbalance=True,
featuresCol='features',
labelCol='label',
maxBin=64,
earlyStoppingRound=100,
learningRate=0.5,
maxDepth=5,
numLeaves=32,
lambdaL1=7.0,
lambdaL2=7.0,
baggingFraction=0.7,
featureFraction=0.7,
numIterations=200,
parallelism='voting_parallel',
timeout=120000.0)

@imatiach-msft
Copy link
Contributor

imatiach-msft commented May 26, 2021

@OldDreamHunter I think that is a red herring, the real error is on one of the other nodes. Can you send all of the unique task error messages? Please ignore the connection refused error.

@imatiach-msft
Copy link
Contributor

you can also try to set useBarrierExecutionMode=True, I think it might give a better error message

@imatiach-msft
Copy link
Contributor

I would only use voting_parallel if you have a high number of features, see guide:
https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html

image

@icankeep
Copy link

icankeep commented Jun 1, 2021

same problem.
Everything will work well when I reduce the number of training data

@Simon-LLong
Copy link

same problem. Voting Parallel works fine, but accuracy is very low. Much data is skipped.

@imatiach-msft
Copy link
Contributor

@Simon-LLong sorry about the problems you are encountering. Indeed Voting Parallel can give lower accuracy, but with much better speedup and lower memory usage.

Can you also please try the new mode:
useSingleDatasetMode = True
numThreads = num cores - 1
These two PRs should resolve this:

#1222
#1282

In performance testing we saw big speedup with new single dataset mode and numThreads set to num cores -1 (as well as lower memory usage).
The two PRs above will be available in 0.9.5 or you can get them with the latest build right now.
In 0.9.5 these params will be set by default, but in earlier versions like currently released 0.9.4 you can set them directly.

For more information on the new single dataset mode please see the PR description:
#1066

This new mode was created after extensive internal benchmarking.

I have some ideas on how a streaming mode can also be added to distributed lightgbm, where data is streamed into the native histogram binned representation, which should use a small fraction of the total spark dataset when everything is loaded in memory. It might be a little slower to setup, but it should vastly reduce memory usage. This is something I will be looking into in the near-future.

@nitinmnsn
Copy link

numThreads (int) – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
Is this number of cores on my executor node, number of cores in my executor or number of cores on my cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants