Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] accuracy improvements #2790

Open
guolinke opened this issue Feb 21, 2020 · 10 comments
Open

[Discussion] accuracy improvements #2790

guolinke opened this issue Feb 21, 2020 · 10 comments

Comments

@guolinke
Copy link
Collaborator

This is to call the accuracy improvement for the tree-based algorithms, include but not limited to:

  • Better regularization methods that improve generalization abilities.
  • Better base tree learner, like piece-wise linear trees, to improve the effectiveness of base learner
  • Better objective functions for special tasks/metrics

If you have any ideas, you can discuss with us here, and open the corresponding issues/pull requests if needed.

@MaxHalford
Copy link

There's an idea I've encoutered in the litterature that seems promising, it's called path smoothing.

Trees overfit because they are are too deep and thus their leaves have too few observations in them. When a leaf has too few observations, then the distribution of the target in said leaf is biased and unreliable. The common way to solve is to restrict the maximal depth of the tree, or to apply pruning.

Another less common way is to do path smoothing: calculate a weighted average with respect to each node from root to the leaf for a given observation. Each node thus contributes to the weighted average. There are different weighting schemes that are possible, they just have to capture the fact that the contribution should increase with the number of observations and the depth. What's nice is that this way, even if a leaf has few observations, it's parents will correct the target distribution by smoothing it out. You can think of it as a Bayesian or hierarchical prior.

@btrotta
Copy link
Collaborator

btrotta commented Mar 14, 2020

@MaxHalford sounds interesting, do you have a reference for this?

@MaxHalford
Copy link

Here are some papers which I have managed to find:

On a side-note, I've been working on a scheme of my own that's giving me very good reasonable and is easy to understand. What I do is that make each node along contribute with a weight which is n_samples * (1 + beta) ** depth, where beta is a positive number. Intuitively, the contribution of a node increases with the number of samples and the depth. The beta parameter controls by how much the depth matters. If a leaf contains no samples, then it's contribution to the weighted average is 0. In that case the other nodes higher up in the tree will be used to estimate the target value.

When I use this scheme, I can virtually increase the maximum depth by how much I want: I never overfit. The path averaging acts a regularizer.

@zkurtz
Copy link
Contributor

zkurtz commented May 8, 2020

Hierarchically clustered observations (such as data that involve repeated measurements of individual subjects) present a challenge for GBMs. The worst case is when all other predictors are constant within-subject. Including the subject ID as a predictor leads to all other predictors being ignored, making it impossibly to predict on a new subject. Even if you exclude the subject ID from the prediction features, the GBM can overfit on the other features, identifying subjects implicitly with deep trees, effectively overfitting to individuals and ingnoring the population-level correlations that are needed to generalize to new subjects. Here's a very nice discussion of the issue with links to numerous recent proposals that could be adapted for LightGBM.

UPDATE: This looks promising and already builds on LightGBM.

@wmiller01
Copy link

I think the concept of a "soft" decision tree is worth implementing. Soft decision trees learn using recursive partitioning, but instead of generating hard split points, each node probabilistically assigns its children to the left or right path. As a result, soft trees learn smooth/continuous functions of the data as opposed to jagged ones.

The problem this solves is that ordinary decision trees with hard splits generate most of their splits in the dense areas of the input space. When an observation from a sparser part of the space is fed into the model, it typically falls into a very wide area which has a constant prediction. Probabilistic splitting resulting in a continuous function would not have a constant prediction in sparse space, and therefore is likely to generalize better. It seems from the literature that soft trees also seem to generate fewer splits to achieve comparable levels of accuracy, leading to more efficient models.

Here are some resourcesI've found on this concept.
https://arxiv.org/abs/1711.09784
https://www.cmpe.boun.edu.tr/~ethem/files/papers/icpr2012_softtree.pdf
http://www.cs.cornell.edu/~oirsoy/softtree.html
https://towardsdatascience.com/building-a-decision-tree-in-tensorflow-742438cb483e

@xuyxu
Copy link

xuyxu commented Jun 24, 2020

Hi, @wmiller01, I have been playing with soft decision trees (SDT) for a while, and I don't think it is a good idea to include it in LightGBM or any other scalable GBDT system.
Pros:

  • SDT is more powerful than traditional trees like CART. For example, applying 10 SDTs onto the vanilla Gradient Boosting algorithm leads to better performance than LightGBM / XGBoost with hundreds of base decision trees (n_estimators * n_classes).

Cons:

  • SDT looks more like a neural network with the tree structure. Its training and inference costs are quite large. For example, the number of leaf nodes in SDT increases exponentially with the tree depth
  • Soft splitting in SDT (e.g., via logistic regression) leads to huge maintenance costs, as it is totally different from the splitting criterion in GBDT using grads and hessians.

@guolinke
Copy link
Collaborator Author

guolinke commented Aug 6, 2020

For the SDT, I agree with @AaronX121 .
Besides of efficiency, I think the learning of SDT is also very different with DT.
The DT is learned by greedy feature selection, while SDT is more like learned by the back-propagation as in neural networks.
The greedy feature selection is a great strength of GBDT. With this function, GBDT is very robust to all kinds of tabular data.
But replacing it to SDT, it may not outperform original GBDT, just like the NN cannot outperform GBDT in many tabular data.

@deadsoul44
Copy link

...., just like the NN cannot outperform GBDT in many tabular data.

Out of topic but is there any reference to show that NN cannot outperform GBDT in many tabular data? Thank you.

@onacrame
Copy link

Hierarchically clustered observations (such as data that involve repeated measurements of individual subjects) present a challenge for GBMs. The worst case is when all other predictors are constant within-subject. Including the subject ID as a predictor leads to all other predictors being ignored, making it impossibly to predict on a new subject. Even if you exclude the subject ID from the prediction features, the GBM can overfit on the other features, identifying subjects implicitly with deep trees, effectively overfitting to individuals and ingnoring the population-level correlations that are needed to generalize to new subjects. Here's a very nice discussion of the issue with links to numerous recent proposals that could be adapted for LightGBM.

UPDATE: This looks promising and already builds on LightGBM.

This looks like an excellent package and would be good to see similar functionality within LightGBM itself. This looks like it’s capable of dealing with spatial data as well as panel data. Panel data has always been one of the weaknesses of trees due to within group correlation and this seems to deal with that explicitly

@shiyu1994
Copy link
Collaborator

@onacrame Added that to the feature requests & voting hub (#2302)

@scarlett2018 scarlett2018 unpinned this issue Nov 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants