-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] accuracy improvements #2790
Comments
There's an idea I've encoutered in the litterature that seems promising, it's called path smoothing. Trees overfit because they are are too deep and thus their leaves have too few observations in them. When a leaf has too few observations, then the distribution of the target in said leaf is biased and unreliable. The common way to solve is to restrict the maximal depth of the tree, or to apply pruning. Another less common way is to do path smoothing: calculate a weighted average with respect to each node from root to the leaf for a given observation. Each node thus contributes to the weighted average. There are different weighting schemes that are possible, they just have to capture the fact that the contribution should increase with the number of observations and the depth. What's nice is that this way, even if a leaf has few observations, it's parents will correct the target distribution by smoothing it out. You can think of it as a Bayesian or hierarchical prior. |
@MaxHalford sounds interesting, do you have a reference for this? |
Here are some papers which I have managed to find:
On a side-note, I've been working on a scheme of my own that's giving me very good reasonable and is easy to understand. What I do is that make each node along contribute with a weight which is When I use this scheme, I can virtually increase the maximum depth by how much I want: I never overfit. The path averaging acts a regularizer. |
Hierarchically clustered observations (such as data that involve repeated measurements of individual subjects) present a challenge for GBMs. The worst case is when all other predictors are constant within-subject. Including the subject ID as a predictor leads to all other predictors being ignored, making it impossibly to predict on a new subject. Even if you exclude the subject ID from the prediction features, the GBM can overfit on the other features, identifying subjects implicitly with deep trees, effectively overfitting to individuals and ingnoring the population-level correlations that are needed to generalize to new subjects. Here's a very nice discussion of the issue with links to numerous recent proposals that could be adapted for LightGBM. UPDATE: This looks promising and already builds on LightGBM. |
I think the concept of a "soft" decision tree is worth implementing. Soft decision trees learn using recursive partitioning, but instead of generating hard split points, each node probabilistically assigns its children to the left or right path. As a result, soft trees learn smooth/continuous functions of the data as opposed to jagged ones. The problem this solves is that ordinary decision trees with hard splits generate most of their splits in the dense areas of the input space. When an observation from a sparser part of the space is fed into the model, it typically falls into a very wide area which has a constant prediction. Probabilistic splitting resulting in a continuous function would not have a constant prediction in sparse space, and therefore is likely to generalize better. It seems from the literature that soft trees also seem to generate fewer splits to achieve comparable levels of accuracy, leading to more efficient models. Here are some resourcesI've found on this concept. |
Hi, @wmiller01, I have been playing with soft decision trees (SDT) for a while, and I don't think it is a good idea to include it in LightGBM or any other scalable GBDT system.
Cons:
|
For the SDT, I agree with @AaronX121 . |
Out of topic but is there any reference to show that NN cannot outperform GBDT in many tabular data? Thank you. |
This looks like an excellent package and would be good to see similar functionality within LightGBM itself. This looks like it’s capable of dealing with spatial data as well as panel data. Panel data has always been one of the weaknesses of trees due to within group correlation and this seems to deal with that explicitly |
This is to call the accuracy improvement for the tree-based algorithms, include but not limited to:
If you have any ideas, you can discuss with us here, and open the corresponding issues/pull requests if needed.
The text was updated successfully, but these errors were encountered: