-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running fit method on LGBMRegressor kills Jupyter Kernel #4301
Comments
Thanks for using LightGBM! We will look at this as soon as possible. Are you able to try this example while monitoring memory usage, maybe with a tool like |
Hi @jameslamb , Thank you for the feedback. In fact I've checked top/htop and didn't spot anything suspicious. Moreover, usually, when I have a memory issue system unambiguously throws MemoryError without killing Jupyter Kernel. It doesn't seem to be the case this time. We have the hypothesis that the issue is caused by the big number of duplicated rows in the dataset. There're about 15k duplicates while the total number of rows is around 16k. Anyway, I've tried to add try/except block and catch LightGBMError but with no luck. Best Regards, Thank you. |
Ok, I pulled this today and was able to reproduce the issue. This is all I see in my Jupyter logs, and not output is printed in the notebook itself.
I reproduced this with the following code, run in a notebook created from the steps at https://github.com/jameslamb/lightgbm-dask-testing/blob/ededb8491c999fa4d9babfbb6749c8e399da2b76/README.md (based on the import lightgbm
import numpy as np
import pandas as pd
import requests
import zipfile
from io import BytesIO
data_url = "https://github.com/microsoft/LightGBM/files/6508547/weird.zip"
zipdata = BytesIO()
zipdata.write(requests.get(data_url, headers={"Accept": "application/octet-stream"}).content)
zip_contents = zipfile.ZipFile(zipdata)
data_file = zip_contents.open('weird.pkl')
test = pd.read_pickle(data_file)
lightgbm.LGBMRegressor().fit(test.drop(columns=['y']), test['y']) I installed git fetch --tags
git checkout v3.2.1
cd python-package
python setup.py install We will look into this! Here are a few theories I tested: "Maybe the error is happening in Dataset construction" --> probably not I tried constructing an ds = lightgbm.Dataset(
data=test.drop(columns=['y']),
label=test['y']
)
ds.construct() "Maybe the error is specific to regression" --> no I also tried switching the problem to classification, to see if we could narrow the problem down to code paths related to regression specifically. That ended up killing the kernel exactly like the lightgbm.LGBMClassifier(verbose=1).fit(
test.drop(columns=['y']),
test['y'] > np.median(test['y'])
) "Maybe the error is specific to GBDT boosting" --> no I tried switching to random forest boosting...that ended up killing the kernel exactly like the original example code did. lightgbm.LGBMRegressor(verbose=1, boosting='rf', bagging_freq=1, bagging_fraction=0.5).fit(
test.drop(columns=['y']),
test['y']
) Tonight or tomorrow, I'll try adding more debugging log statements to the library to see if we can narrow this down. |
I just tried these examples on latest |
Hi @jameslamb, Thank you for your thorough analysis. Some observations from my side:
I've cut the dataset there're still a lot of duplicates However no crash. So this is a very strange behavior, but indeed I also have a strong smell of "split problem". |
Hi @jameslamb , I've also noticed that the issue is not specific for Jupyter Notebook. I put everything in Weird.py file import pandas as pd test = pd.read_pickle('weird.pkl') Then I've launched it via command line And my program ended up with only "before" being printed out. Thanks! |
I've tried a few more things, and have narrowed this down a bit further but I'm still not sure exactly where the problem is. I create a branch with a LOT more logging, on my fork. You can see the logs I've added at https://github.com/microsoft/LightGBM/compare/master...jameslamb:louder-logs?expand=1. Given this script (notice I've added # test.py
import zipfile
from io import BytesIO
import lightgbm
import pandas as pd
import requests
data_url = "https://github.com/microsoft/LightGBM/files/6508547/weird.zip"
zipdata = BytesIO()
zipdata.write(requests.get(data_url, headers={"Accept": "application/octet-stream"}).content)
zip_contents = zipfile.ZipFile(zipdata)
data_file = zip_contents.open("weird.pkl")
test = pd.read_pickle(data_file)
lightgbm.LGBMRegressor(verbose=1).fit(test.drop(columns=["y"]), test["y"]) and LightGBM built from that branch I linked to above, I see several different outcomes running Outcome 1: best split results in a leaf node with 0 records Sometimes, that scripts ends with this error.
which comes from
full log
Outcome 2: training succeeds Sometimes, training succeeds without error. full log
Outcome 3: Unable to initialize a
This comes from LightGBM/src/treelearner/serial_tree_learner.cpp Lines 28 to 72 in da3465c
full log
|
Ok I added some additional fine-grained logs to
Tomorrow (if no one else gets to it sooner), I can try adding more logs to narrow it down further. |
Ok, I have some more specific logs. Sometimes training fails here: Lines 644 to 645 in da3465c
full logs
And sometimes it fails at Line 682 in da3465c
full logs
@shiyu1994 could you look at the discussion in this issue and let me know if you have any ideas for things to test? I'm not sure what to test next. |
@jameslamb The error happens in |
Actually, though the problem occurs in Lines 146 to 152 in 346f883
we are intended to find something between the two numbers upper_bounds[i]=-0.00112051336219505895 and lower_bounds[i+1]=-0.00112051336219505852 , so that the two numbers will be divided into two different bins. Note that here both the upper_bounds[i] and the lower_bounds[i+1] are from the distinct feature values of a feature.
But unfortunately, due to the discretization nature of floating point numbers, after calling Now switch to the logic where we calculate the number of sampled data in each bin. Lines 411 to 418 in 346f883
The code above by default assumes that, if distinct_values[i] > bin_upper_bound_[i_bin] , then we must have distinct_values[i] <= bin_upper_bound_[i_bin + 1] . It is assuming that by incrementing i_bin by 1, we can always get to the correct bin for this distinct_values[i] . But as we mentioned above, since it is possible to get some empty bins. This assumption is wrong. So we may get a wrong count in cnt_in_bin .
So far, it seems the problem is not very serious, because Lines 506 to 507 in 346f883
But, if we wrongly calculated the most_freq_bin_ , even if only differ by 1, troubles may happen. Because the sparse rate is used to estimate the number of feature bins other than most_freq_bin_ (estimate_total_entries ) in the whole training data,Lines 696 to 735 in 346f883
which further decides the type of the integers (uint16_t, uint32_t or uint64_t) we used in the row pointer of the sparse data representation of MultiValSparseBin . (Note that in the sparse representation, we treat most_freq_bin_ as missing to maximally reduce the memory cost.)
In the example above, the This is a very raw case, it only happens when there're two extremely close feature values, and these two values happens to be near the most frequent bin. But it is still worth noticing for the robustness of LightGBM. Thanks @VadimLopatin for reporting this! Because Line 414 in 346f883
with while (distinct_values[i] > bin_upper_bound_[i_bin] && i_bin < num_bin_ - 1) { Sorry for the long description, but I think it is valuable to make the problem clear for future reference. |
@VadimLopatin thank you again for the bug report and reproducible example. This is now fixed on this project's |
Hello @jameslamb , @shiyu1994 , Many thanks for your analysis and quick fix! |
Visual Studio Code (1.79.2, undefined, desktop) 07:58:29.872 [info] Dispose Kernel process 16359. 07:59:34.256 [info] Dispose Kernel process 16419. 07:59:55.179 [info] Dispose Kernel process 16499. 08:02:32.629 [info] Dispose Kernel process 16720. 08:04:04.331 [info] Dispose Kernel process 16853. |
@dhirajpatra There is nothing we can do with just a dump of a ton of logs like that. I'm not even sure how you drew the conclusion that you issue is the same as this one. I'm going to lock this issue. If you'd like some help, please open a new issue at https://github.com/microsoft/LightGBM/issues, including a minimal, reproducible example with enough information for someone to help you. That would include:
|
Hello team,
I'm facing a weird behavior of LGBMRegressor. Sometimes it causes Jupyter Kernel to die on some datasets.
I was able to minimize the test case.
The code itself is pretty simple:
import pandas as pd
import numpy as np
import lightgbm
test = pd.read_pickle('weird.pkl')
lightgbm.LGBMRegressor().fit(test.drop(columns=['y']), test['y'])
In order to reproduce you will need to launch it for the specific dataset:
weird.zip
versions:
pandas - '1.1.5', numpy - '1.19.5', lightgbm - '3.2.1'
I was able to reproduce on Windows and on Ubuntu
Command(s) you used to install LightGBM
pip install lighgbm
Could you please advise how would it be possible to tackle this point?
Thank you!
The text was updated successfully, but these errors were encountered: