-
Notifications
You must be signed in to change notification settings - Fork 836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lightgbm - mysterious OOM problems #1124
Comments
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it. |
Hi @trillville, sorry to hear of your problems. I see you're using the latest branch, but have also tried other recent branches, so perhaps it is not the result of a recent change. Just a few followup questions: It appears pretty clear from the message this OOM exit code 137 appears within a node. Unfortunately on the JVM side, the execution is fairly monolithic, with most logic happening inside the native C++ implementation of LightGBM. Aside from the stack trace, were there any output logs that might indicate what the native code was doing? I'd be surprised if it was anything other than the dataset preparation step as it converts the data from the JVM into the native format, but in the end I'm just guessing. In the remainder I'm going to work though on the assumption that the guess is correct. So just for laughs, would you mind repartitioning to 16 explicitly yourself, just to see if it's a simple dataset task imbalance issue? The reason I think this might be important is, the intermediate memory buffers constructed on the Scala side before passing into native code can get pretty large. The current Spark wrapper of LightGBM does not construct a LightGBM dataset in a streaming fashion. That said, I might expect 128 GB per task to easily be enough. But I wonder then, is the input data partitioned so it is evenly spread among those 16 workers? The wrapper does contain some repartition logic, but it is somewhat conservative w.r.t. what it attempts, only ever reducing the number of partitions. (As seen here.) Failing that, another thing that comes to midn is there was one relatively recent PR #1066 that you then have access to, that is meant in some cases to change how the datasets are prepared, that in some cases reduces the amount of memory used by the nodes. It is the |
@trillville yes, could you please try setting the new parameter: |
Thank you both for the suggestions! In my case |
thanks again - everything is working for me :) |
I am consistently getting errors like this at the reduce step while trying to train a lightgbm model:
dataset rows: 208,840,700
dataset features: 110
size: ~150GB
training code/params:
cluster config:
3x n2-highmem-16 workers (16 vcpus + 128 memory each)
** Stacktrace**
When I do:
train = train.sample(withReplacement=False, fraction=0.25)
the job ran successfully. I'm kinda guessing that I could fix it by throwing more resources at the problem, but I would think my current cluster should be totally overkill given the dataset size.So far I've tried:
useBarrierExecutionMode
maxBin
numTasks
to a small number (3)LightGBMClassifier
specificationI am on spark 3.0 and using com.microsoft.ml.spark:mmlspark:1.0.0-rc3-148-87ec5f74-SNAPSHOT
Thank you!
The text was updated successfully, but these errors were encountered: