Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Using forcedsplits parameter causes wild inaccuracies and crashes #4591

Closed
Sinnombre opened this issue Sep 3, 2021 · 14 comments
Closed

Comments

@Sinnombre
Copy link

Sinnombre commented Sep 3, 2021

Description

Maybe I don't understand the function of this parameter but I am having a great deal of trouble using it.

I'm working with forecasting sales using LightGBM in R. In the data I have (which unfortunately I am unable to share), the overwhelming majority of items sell 0-1 per week, with about 0.3% outliers with weekly sales averaging >20, some going into the 1000s. I observed that separating the data into three training runs, for high, medium and low performers, resulted in substantially better accuracy. From my understanding of LightGBM, training three separate models based on one feature like this should be equivalent to forcing the first two steps of the decision trees to split based that feature, so I looked into this and found the forcedsplits_filename parameter.

However, whenever I use forcedsplits_filename, I get a huge number of warnings, frequent crashes and, even when it works, incredibly inaccurate results.

I've reproduced the crash with the example code below. The error message is:

Error in lgb.call(fun_name = "LGBM_BoosterUpdateOneIter_R", ret = NULL, :
[LightGBM] [Fatal] Check failed: (best_split_info.right_count) > (0) at treelearner/serial_tree_learner.cpp, line 663 .

I have determined that the crashes only occur when I include a categorical feature with high cardinality which tracks closely to the parameter I'm forcing splits on (specifically the item ID), so my theory is that certain splits of this feature contain no data with values above the threshold, causing a tree split taking both the ID feature and the forced split feature into account to result in a branch with no samples.

Searching this forum I've found several other issues around this error, but those seem to have been resolved with the latest version and were unconnected to forcedsplits.

Reproducible example

library(data.table)
library(lightgbm)

set.seed(1)
ID = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5))
val = c(runif(15),runif(5)*10,runif(5)*100)
data = data.table(ID,val)
data[,mean_val := mean(val),by='ID']

data_valid = data[c(1,6,11,16,21)]  # yeah my validation set is a subset of the training set. Its a demo ;)

features = c("ID","mean_val")

train_ds = lgb.Dataset(data.matrix(data[,features, with = FALSE]), 
                        label = data[["val"]])
lgb.Dataset.set.categorical(train_ds, c("ID"))

valid_ds = lgb.Dataset(data.matrix(data_valid[,features, with = FALSE]), 
                       label = data_valid[["val"]])
lgb.Dataset.set.categorical(valid_ds, c("ID"))

i = which(features == 'mean_val') - 1

string = paste0('{
  "feature": ',i,',
  "threshold": 5,
  "right": {
    "feature": ',i,',
    "threshold": 30
  }
}')

write(string,file="split_test.json")

params = list(objective = "tweedie",
              metric = "rmse",
              boosting = "goss",
              learning_rate =0.01,
              num_leaves = 20,
              min_data_in_leaf = 5
              ,forcedsplits_filename = "split_test.json"
              )

fit <- lgb.train(params, train_ds, num_boost_round = 20, 
                 eval_freq = 5, early_stopping_rounds = 5, 
                 valids = list(valid = valid_ds), verbose = 1)

OUTPUTS:

[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000321 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12
[LightGBM] [Info] Number of data points in the train set: 25, number of used features: 2
[LightGBM] [Info] Using GOSS
[LightGBM] [Info] Start training from score 2.343786
Error in lgb.call(fun_name = "LGBM_BoosterUpdateOneIter_R", ret = NULL,  : 
  [LightGBM] [Fatal] Check failed: (best_split_info.right_count) > (0) at treelearner/serial_tree_learner.cpp, line 663 .

Environment info

R version 4.0.4 (2021-02-15)
R Studio version 1.4.1717
lightgbm version 3.2.1

Additional Comments

In addition to the crash, in my main code (the data for which I cannot share) I also get frequent instances of the warnings:

[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] 'Forced Split' will be ignored since the gain getting worse.

Even with the forced split, I'm not sure how its finding best gains of -inf? I'm pretty sure the other two warnings follow from this issue.

And finally, even when the crash doesn't occur, I find results which are orders of magnitude off. I have no been able to replicate this with a simple example but in one run I found:

fit <- lgb.train(params, train_ds, num_boost_round = 5, 
                 eval_freq = 1, early_stopping_rounds = 2, 
                 valids = list(valid = valid_ds), verbose = 1)

-- WITH forcedsplits_filename parameter

[LightGBM] [Info] Total Bins 64084
[LightGBM] [Info] Number of data points in the train set: 850569, number of used features: 27
[LightGBM] [Info] Using GOSS
[LightGBM] [Info] Start training from score 3.452928
[1] "[1]:  valid's rmse:7.28253e+10"
[1] "[2]:  valid's rmse:7.1417e+10"
[1] "[3]:  valid's rmse:7.00028e+10"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[4]:  valid's rmse:7.00028e+10"
[LightGBM] [Warning] 'Forced Split' will be ignored since the gain getting worse.
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[5]:  valid's rmse:7.1417e+10"
[1] "Test Set Error = 194807540182.268"

-- WITHOUT forcedsplits_filename parameter

[LightGBM] [Info] Total Bins 64084
[LightGBM] [Info] Number of data points in the train set: 850569, number of used features: 27
[LightGBM] [Info] Using GOSS
[LightGBM] [Info] Start training from score 3.452928
[1] "[1]:  valid's rmse:106.121"
[1] "[2]:  valid's rmse:105.897"
[1] "[3]:  valid's rmse:105.676"
[1] "[4]:  valid's rmse:105.458"
[1] "[5]:  valid's rmse:105.242"
[1] "Test Set Error = 31.1652060115114"

I don't understand why the errors are so huge. The largest label in the training data is 1729 and none are negative, so even if every leaf returned 1729 the worst RMSE should be somewhat less than that; how can any leaf in a decision tree have a value 8 orders of magnitude higher than any actual label? And why is this happening when I simply add a single forced split?

@jameslamb jameslamb changed the title Using forcedsplits parameter causes wild inaccuracies and crashes [R-package] Using forcedsplits parameter causes wild inaccuracies and crashes Sep 4, 2021
@jameslamb
Copy link
Collaborator

Thanks very much for using {lightgbm} and for the thorough write-up. I'll look into this shortly.

I've edited the formatting of your original post to make it a bit easier to read. If you are new to GitHub, please consider reading through https://docs.github.com/en/github/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax to learn how to use GitHub-flavored markdown to format posts here.

@jameslamb
Copy link
Collaborator

@Sinnombre can you please try updating to the latest version of {lightgbm} on master of this repo?

git clone --recursive https://github.com/microsoft/LightGBM.git
cd LightGBM
sh build-cran-package.sh
R CMD INSTALL lightgbm_3.2.1.99.tar.gz

I tried your reproducible example (thanks very much for providing that!!) and found that for {lightgbm} 3.2.1 installed from CRAN, I also get the error

[Fatal] Check failed: (best_split_info.right_count) > (0)

When I use the version of the R package built from master, I do not get such an error. Training succeeds.

There have been a lot of fixes to LightGBM since 3.2.1 was released in April. I recommend subscribing to #4310 to be notified when release 3.3.0 comes out. We apologize for the inconvenience.


[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

This warning is raised whenever LightGBM stops growing a tree before other tree-specific stopping conditions like num_leaves and max_depth are encountered. It is usually harmless, but it's there to tell you that the training parameters that you've chosen might not be well matched to your training data.

For example, with the R package on master I see this warning a lot using the reproducible code you've provided. That's because you've provided num_leaves = 20, min_data_in_leaf = 5 for a dataset with only 25 samples.


how can any leaf in a decision tree have a value 8 orders of magnitude higher than any actual label?

Since you weren't able to produce a reproducible example for this it's difficult for me to say with confidence what is happening, unfortunately. But could you please try running the code that produced that result using the R package built from latest master? Since {lightgbm} 3.2.1, we've factored out some custom code that was used to pass data between the R package and C++ side, and which could occasionally lead to issues similar to this.

@Sinnombre
Copy link
Author

Hi James thanks for your quick reply. I installed the latest version from master and while it did fix the simple scenario, I still get crashes with the larger test case. It turns out I can share the data though (since it's anonymized), so please see attached zip. It works fine when line 60 (the ForcedSplits parameter) is commented out, but with it the crash still occurs. drive link: https://drive.google.com/file/d/1JsP7uEx09d2JQxiST6byk9YxdnKNOAsY/view?usp=sharing

Also, in cases that work I frequently get the message:
[LightGBM] [Warning] 'Forced Split' will be ignored since the gain getting worse.
Is there a way to turn this off (e.g. turn off the ability to ignore the specified splits, and either early stop when this happens or keep trying to reduce gain while maintaining the forced split)? This warning seems to be saying "Hey I know you set this parameter to avoid a specific type of overfitting that's highly problematic to your use case, but I found that if I ignore you I can overfit myself really well, isn't that great?"

@jameslamb
Copy link
Collaborator

jameslamb commented Sep 8, 2021

I still get crashes with the larger test case. It turns out I can share the data though (since it's anonymized), so please see attached zip

Please also provide the exact code you're using to train on this data if you'd like me to test it.

Is there a way to turn this off

There is not a parameter you can use to suppress this warning.

It comes from this point in the source code

// gain with split is worse than without split
if (std::isnan(current_gain) || current_gain <= min_gain_shift) {
output->gain = kMinScore;
Log::Warning(
"'Forced Split' will be ignored since the gain getting worse.");
return;
}

That is part of a method eventually called in SerialTreeLearner::ForceSplits()

leaf_histogram_array[left_inner_feature_index].GatherInfoForThreshold(

Which is called at the beginning of training for each tree.

int init_splits = ForceSplits(tree_ptr, &left_leaf, &right_leaf, &cur_depth);

So I believe (@shiyu1994 or @StrikerRUS please correct me if I'm wrong) that it would be more accurate to say that that warning means

this split you've asked to add would lead to a worse fit to the training data than just not adding any splits at all, so it's skipped

If I'm right about that, it means that the forcedsplits feature allows you to force LightGBM to prefer a specific split as long as that split improves the fit at all, even if it does not improve the fit as much as other candidate splits.

@Sinnombre
Copy link
Author

I believe the code file on the google drive works entirely on its own (with the two common libraries) does it not?

I would also like clarification on your last point there; it seems improvable to me that, given the number of features and the fact that 'improvement' splits keep being found for tens of thousands of iterations without the forced splits, adding the force split would result in NO candidates that improve gain at all after only a couple iterations. I can definitely see it not being optimal, or even being the case that simply taking out that initial forced split would improve the gain, but that's kinda the point; 'improved gain' in this case is likely coming from overfitting. Requiring a forced split is basically saying 'treat these as separate problems based on this feature', presumably the user has a reason for doing so? I guess I see two use cases for ForcedSplits, either telling the learner 'hey I have insight into the features and I think you will get the best results trying this first,' or telling it 'hey I know my data is biased so you will overfit if you don't do this.'

Anyway thanks again for looking into this, and your insight into how the learner works!

@jameslamb
Copy link
Collaborator

jameslamb commented Sep 8, 2021

adding the force split would result in NO candidates that improve gain at all after only a couple iterations

Ah I wasn't clear, this is not what I mean or what I think the code is doing. I think it's very possible to see this warning and behavior in situations where forced_split --followed_by--> some_other_split would result in an improvement in the training loss (compared to not adding both of those splits).

I believe LightGBM is saying "if the tree stopped growing after this forced split (its nodes became leaf nodes), would the gain compared to a tree which stopped growing before this split be greater than min_gain_to_split which defaults to 0.0)?"

There isn't a pruning process in LightGBM where all combinations of splits are tried and then LightGBM picks the best complete sequence. Splits are added one at a time, based on which split provides the best gain. (https://lightgbm.readthedocs.io/en/latest/Features.html#leaf-wise-best-first-tree-growth)

My interpretation of the code path generating that warning above is that tree growth in LightGBM works like this:

  • without forced splits: "if there are any splits satisfying min_gain_to_split and min_data_in_leaf and min_sum_hessian_in_leaf, use the one with the largest gain"
  • with forced splits: "if the forced split for this point in the tree offers a gain greater than min_gain_to_split, use it. Otherwise, use the split with the largest gain as long as it also satisfies min_gain_to_split, min_data_in_leaf, and min_sum_hessian_in_leaf

If I'm right about that (let's see if another maintainer confirms that, I'm not as knowledgeable as some others here 😬 ), then I think there's definitely an opportunity to improve the documentation on this!

Requiring a forced split is basically saying 'treat these as separate problems based on this feature"

Totally makes sense to me! But I think that the fact that LightGBM uses leaf-wise growth (what XGBoost refers to as lossguided) means that this might not be a reliable approach to fitting multiple models at once. That's because even if all your forced splits are added to each tree, LightGBM's tree growth decisions after that will still be based on adding splits which provide the best overall gain.

Imagine, for example, that you have a forced split which always sends 90% of samples to the left of the first split and 10% to the right. For a wide range of loss functions and depending on the distribution of the target, I think LightGBM is going to tend to prefer splits on the left side, because they'll offer a larger total gain. And as a result, tree growth might hit tree-specific stopping conditions like max_depth or num_leaves after having mostly "worked on" the problem on the left side.

If you want try to train a LightGBM model to work on two problems, you might find that you have greater control by writing a custom objective function. You can see the following for an example of how to do this in the R package.

custom_multiclass_obj <- function(preds, dtrain) {

If you just want to control overfitting generally, you can try some of the other suggestions at https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html#deal-with-over-fitting

I believe the code file on the google drive works entirely on its own (with the two common libraries) does it not?

Oh sorry! I just expected that file to contain data. I'll take a look later tonight or tomorrow and see if I can reproduce the crash you ran into.

Is it ok for me to re-post the code you've provided here? We have a strong preference for posting code in plaintext (not links to external services) so it's usable by others who find this issue from search engines in the future.

@Sinnombre
Copy link
Author

Sinnombre commented Sep 8, 2021

Yeah it's fine to repost the code. Thanks for the detailed explanation, that does make sense. The part that confuses me though is that forced splits are the first splits the tree makes. So either my forced split to two leafs reduces the gain vs. a tree consisting of just one leaf, in which case I should never get the warning, or my forced split results in a two-leaf tree with worse gain than the initial one leaf tree, in which case I should always get the warning. But in practice I don't get the warning for the first several iterations, then it starts showing up somewhere down the line. If my understanding of what your were saying is correct I don't see how this makes sense?

Also, at least for me, it's not a problem that the learner focuses more on the more populated side of the tree; that's exactly what I would want, to use the time and memory budget I allot it to optimize the most impactful sections.

@jameslamb
Copy link
Collaborator

...either my forced split to two leafs reduces the gain vs. a tree consisting of just one leaf, in which case I should never get the warning, or my forced split results in a two-leaf tree with worse gain than the initial one leaf tree, in which case I should always get the warning

I think you're missing an important point that I didn't include in previous posts because it's implicit in the use of LightGBM. Because LightGBM is a gradient boosting library, you can't safely assume that a specific split's gain will be the same across all iterations.

Each additional tree is fit to explain the errors of the model up to that point (something like "residuals between the true value of the target and the predicted value you'd get from the model in its current state"). If you haven't seen it, XGBoost's docs have an excellent tutorial on how the boosting process works: https://xgboost.readthedocs.io/en/latest/tutorials/model.html#.

That is how you can get the behavior of "I don't see this warning for the future few iterations but then it shows up in later iterations".

Also, I want to be sure it's clear...I'm not saying that using a single forced split means all your trees will be either one leaf (0 splits) or two leaves (1 split). Just explaining that in the way LightGBM grows trees, it adds splits one at a time.

it's not a problem that the learner focuses more on the more populated side of the tree; that's exactly what I would want, to use the time and memory budget I allot it to optimize the most impactful sections.

If you aren't trying to achieve the behavior of "train one model which performs similarly well on different parts of my training data's distribution" and just want to produce a model that provides the most accurate predictions of the target overall, then you shouldn't use forced splits at all.

@jameslamb
Copy link
Collaborator

Ok @Sinnombre , I was able to reproduce the errors you saw running the code you provided in #4591 (comment). Thanks very very much for that!

Specifically, running your provided code with your provided data, training regularly failed with the following error

[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at treelearner/serial_tree_learner.cpp, line 653 .

I ran this using R 4.1.0 on my Mac, with {lightgbm} installed from latest master.

It looks like you've uncovered a pretty challenging bug in LightGBM! And I suspect it affects LightGBM's core library, not only the R package. I was able to create a reproducible example for it in R using only the built-in ChickWeight dataset. Please see #4601 for that example, and follow that issue to be notified of activity towards changing it (or contribute a fix, if you're comfortable writing C++!).

For now, if you want to use forced splits the only reliable way I've found to avoid the error you hit is to set feature_fraction = 1.0. I don't think this should be too problematic, since the training data you've provided only uses around 25 features.

@jameslamb
Copy link
Collaborator

By the way (unrelated to this issue), I noticed this in your sample code:

train_dt = data.table(read.csv("TrainingData.csv"))

I think you'll find that it's much faster to use data.table::fread() instead.

train_dt = data.table::fread("TrainingData.csv")

@Sinnombre
Copy link
Author

Dang sounds like a fix will be a while out then. Thanks for looking into it, and for answering my other questions!
Upgrading to the master version and setting feature_fraction = 1.0 did fix the other issues (inaccuracy and excess warnings), though unfortunately I'm still not getting comparable results to just splitting the data into multiple models.

If you aren't trying to achieve the behavior of "train one model which performs similarly well on different parts of my training data's distribution" and just want to produce a model that provides the most accurate predictions of the target overall, then you shouldn't use forced splits at all.

The issue with this is the bias. The training data has hundreds of thousands of examples with sales between 0-5 per week, and a few dozen with weekly sales in the thousands. If I weight towards the high-preforming items, the model massively overestimates the sales of the low performers. If I don't weight aggressively towards the high numbers, than the model massively underestimates sales of the high performers; this resulted in a very good RMSE, but when I compare sum(predictions) to sum(labels) its like 70%, due to the underestimation of the high sellers. My goal with all this is to split the data into effectively separate trees, trained on just high or low preforming data, but preserving the settings s.t. total training time and model size were constrained by one set of parameters.

@jameslamb
Copy link
Collaborator

Makes sense, makes sense. That type of use-case was in fact even mentioned as the reason for adding the forced_splits feature to LightGBM originally, back in #1310.

Forcing splits can help explicitly define the subgroups you care about at the very top of every decision tree so that subsequent splits will help optimize directly within that subgroup. This, combined with setting appropriate weights on the data, can help ensure that the loss isn't significantly worse on one subgroup than another. It will reduce the need to train separate models on each subgroup.

But I think the key point there is "combined with setting appropriate weights". As I think you're seeing, it can be difficult to set up the right combination of weights + tree growth parameters like num_leaves to be able to achieve the same performance from one model in a situation like this as you'd get from training separate models on the different populations.

@jameslamb
Copy link
Collaborator

Going to close this, as it seems there are no remaining open questions.

Thanks very much for raising this issue and providing some sample code, as it helped us to identify a bug (#4601)!

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants