-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trees with linear models at leaves #3299
Conversation
… dataset having incompatible parameters.
@StrikerRUS @guolinke IMO this pull introduced a bug or unexpected behavior in the class method SizesInByte. It no long returns the data_.size() but the AlignedSize(data_.size()). This is BAD if you are using it for allocation or copying from a desire location in the data. In the past, SizesInByte was NEVER BIGGER than num_data_ but NOW in some case it is. There should be some way of getting the original data_.size(). SizesInByte is used in FeatureGroupSizesInByte and should be the same as FeatureGroupData's get_data().size and not the AlignedSize. |
@ChipKerchner It seems that the AlignedSize(data_.size()) in SizesInByte is not introduced by this PR, but #3415 |
@shiyu1994 @guolinke |
When will this go to pip please? Really interested to give this a try! |
@shiyu1994 - what was the reason for this? Does it support custom functions? |
The next formal release is being tracked in #3872 , you can subscribe to that for updates. We cannot give an exact date at this time, but you can see a list there of the work that still needs to be done. |
For now please feel free to install from nightly wheels. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html |
@StrikerRUS I only looked at the git pip install on the pip website itself. This broke because "Windows build issues". I will give the Wheels a spin!! Also, @jameslamb, I recognised you name but wasn't sure where from. I've just realised!! recent-developments-in-lightgbm. Excellent video so thanks. Hopefully we get another one with 4.0 😉🤞 Thank both/all 👏 |
Hi @btrotta, thanks for your work! Is it possible to access the coefficients and offsets of the linear models at each leaf? I did not find this information in e.g. the |
Hi @spiralulam, thanks for using LightGBM. If you dump the tree model with |
Thanks for the answer. Does that mean I basically have to parse a string to obtain these information using |
Just checked the method that dumps a tree to JSON. Unfortunately, information of linear leaves are not handled. So currently parsing from the model text file seems to be the only solution. Sorry for the inconvenience. This should be a very useful case when using linear trees, and I believe we should provide direct access to linear model coefficients through C++, Python and R API. |
That would be awesome, indeed. |
Hi @btrotta, Thx for your work! Does the code allow using a subset of the features for the tree splits and another totally different to estimate the linear model at each leaf? |
@cc22226 Thanks for using LightGBM. Currently the linear models at leaves will consider all the numerical features (or non-categorical features). And there's no parameter to control which features are used in the linear models and which are used in the splits. I think it would be nice sometimes to have these two parts of features being separated. And maybe we can leave that as a feature request. |
Thanks for your response.
Cesar
…On Thu, Jun 17, 2021 at 11:49 PM shiyu1994 ***@***.***> wrote:
@cc22226 <https://github.com/cc22226> Thanks for using LightGBM.
Currently the linear models at leaves will consider all the numerical
features (or non-categorical features). And there's no parameter to control
which features are used in the linear models and which are used in the
splits. I think it would be nice sometimes to have these two parts of
features being separated. And maybe we can leave that as a feature request.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3299 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANQTDQEMNLEGE673A7OWED3TTK66LANCNFSM4P4HS22Q>
.
|
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Implements boosting for trees with linear models at the leaves (sometimes called M5 trees). This is a hybrid between traditional tree boosting and the model proposed in the paper Gradient Boosting with Piece-Wise Linear Regression Trees by Shi, Li, and Li (https://arxiv.org/pdf/1802.05640.pdf), which is mentioned in #1315. In this PR, the tree structure is created by finding the best split in the normal way, but then we calculate a linear model on each leaf. In contrast, in the paper, the splits are chosen by calculating the linear models for each potential split point, which is much more computationally intensive and would require more significant code changes in LightGBM. (The paper above actually mentions M5 trees, in Appendix D, but only tests existing slow implementations, which give poor results compared to their code. I think with the better implementation from this PR, M5 trees would come close to the performance of the fully-linear approach.)
The running time of the linear-leaf model is around 10-20% more than traditional tree boosting (depending on the dataset etc), but it converges faster, so overall it gives a small improvement in training time (and also can achieve slightly better accuracy). Memory use is higher since we need so store the full feature data.
Regularisation can be controlled with the parameter
linear_lambda
; this is important because it's more prone to overfitting than the traditional tree-boosting model. Also it's important to scale the data before training so that all features have similar mean and standard deviation.The code uses parts of the Eigen library licensed under MPL2.
I have only implemented this for Python. I think to get it working for R it would require some changes to the data loading interface, but I'm not very familiar with R, so maybe someone else would like to take that on.
Here is a test script to measure performance on the SUSY physics data (https://archive.ics.uci.edu/ml/datasets/SUSY).
Output:
Full training graph:
Training graph zoomed in on y-axis to show different convergence: