Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantile LightGBM - inconsistent deciles #3447

Closed
TomekPro opened this issue Oct 9, 2020 · 14 comments
Closed

Quantile LightGBM - inconsistent deciles #3447

TomekPro opened this issue Oct 9, 2020 · 14 comments
Labels

Comments

@TomekPro
Copy link

TomekPro commented Oct 9, 2020

Hi, I'm doing Quantile LightGBM, defined as follows:

QuantileEstimator(lgb.LGBMRegressor(n_jobs=-1,
                             seed=1234,
                             learning_rate=0.1,
                             reg_sqrt=True,
                             objective = 'quantile',
                             n_estimators=100))

Where QuantileEstimator is just:

class QuantileEstimator(BaseEstimator, RegressorMixin):

    def __init__(self, model):
        """
        """
        self.alphas = [round(x, 1) for x in np.arange(0.1, 1, 0.1)]
        self.model_factory = []
        self.model = model
        super().__init__()

    def fit(self, X, y=None):
        
        for a in self.alphas:
            model_i = clone(self.model)
            model_i = model_i.set_params(**{'alpha': a})
            model_fitted = model_i.fit(X, y)
            self.model_factory.append(model_fitted)
        return self

    def predict(self, X):
        predictions = pd.DataFrame()
        for a, m in zip(self.alphas, self.model_factory):
            predictions["y_pred_" + str(a)] = m.predict(X)
        return predictions

Sometimes it happens that following decile for individual predictiions has lower value than the previous one.
image

Any idea why that's happening? I understand that those are different models, but still as I set the same seed I would expect that the results for different deciles should be monotonic.

@guolinke
Copy link
Collaborator

I think it may be a bug, will investigate it when have time.

@TomekPro
Copy link
Author

Thanks, please let me know if you would need some additional info. Maybe the distribution of undelying data (y) could be a clue?
image

@guolinke
Copy link
Collaborator

@TomekPro could you provide a data for reproducibly? the small dataset or randomly generated dataset will be better.

@StrikerRUS
Copy link
Collaborator

gently ping @TomekPro for example of data

@mburaksayici
Copy link

mburaksayici commented Feb 9, 2021

Hi, having the same problem but didn't think that it's a bug. When I see this topic i see its worth questioning that.
I calculate 10th,30th,50th,70th,90th quantiles for time series regression problem. On some points, p30-p50-p70 is mixing up. Until I see the suggestion of @TomekPro I thought it's due to randomity of the training process, because when I average ensemble the same model with same features for 3-4 times quantiles were getting organised.
My data is not normally distributed, so it makes sense, yes.

image

Diving into theory in order to see if it's a bug, I have a feature that can separate the data into two normal distribution(open to discussion, but it's more normally dist.) which is in most important 10 features. However, lgbm doesn't guarantee to use the this feature at the top of the trees, so doesn't guarantee that the final output is normal. I'm pointing this out because I wonder even if the data is not normally distributed, trees can find normally distributed subparts of the data. And it may resolve the problem of innormally distributed target, trees can find normally distributed subparts of it. Is it the case for trees?

image

I sometimes catastrophically end up with quantiles such as this:

image

In general, quantiles are mixing up. For the mix of p30-p50-p70, @TomekPro 's suggestion seems to be plausible, however,in some cases I have that p50 is out of the range of p10-p90. Sometimes, p50 gives more accurate results but p10-p90 interval doesn't capture the real value at all.

image

Again, i'm not sure whether it's bug or not, or may be I'm modelling something wrong, or it's just a randomness since we don't fit an actual normal distribution with (mean,std) parameters (NGBoost does that) but we estimate it, or using lots of features etc. Again, i'm noting that when I run the same quantile with same model/features and simply average the results, I have more consistent results.

@shiyu1994
Copy link
Collaborator

I think with GBDT, quantile objective does not guarantee the prediction value of an instance (data point) to be monotone. Just as we cannot guarantee that if we increase the label of one instance in the training data, then the prediction of that instance will increase if we retrain the model. Because the data partition in the trees can change, and the leaf prediction value can be very different.

Here's an example where the predicted decile is not monotone.

import lightgbm as lgb
from sklearn.base import *
import pandas as pd
import numpy as np

class QuantileEstimator(BaseEstimator, RegressorMixin):

    def __init__(self, model):
        """
        """
        self.alphas = [round(x, 1) for x in np.arange(0.1, 1.0, 0.1)]
        self.model_factory = []
        self.model = model
        super().__init__()

    def fit(self, X, y=None):
        
        for a in self.alphas:
            model_i = clone(self.model)
            model_i = model_i.set_params(**{'alpha': a})
            model_fitted = model_i.fit(X, y)
            self.model_factory.append(model_fitted)
        return self

    def predict(self, X):
        predictions = pd.DataFrame()
        for a, m in zip(self.alphas, self.model_factory):
            predictions["y_pred_" + str(a)] = m.predict(X, raw_score=True)
        return predictions

qet = QuantileEstimator(lgb.LGBMRegressor(n_jobs=1,
                             seed=1234,
                             learning_rate=0.1,
                             reg_sqrt=True,
                             objective = 'quantile',
                             n_estimators=2,
                             min_data_in_leaf=1,
                             num_leaves=3,
                             boost_from_average=True,
                             verbose=2))

np.random.seed(2)
X = np.random.rand(10, 20)
y = np.random.rand(10)
qet.fit(X, y)
pred = qet.predict(X)
for i in range(pred.shape[0]):
    for j in range(pred.shape[1] - 1):
        if pred.iloc[i, j] > pred.iloc[i, j + 1]:
            print(pred.iloc[i, :])
            print(i)

And the output is

y_pred_0.1    0.504354
y_pred_0.2    0.660557
y_pred_0.3    0.729929
y_pred_0.4    0.725802
y_pred_0.5    0.742658
y_pred_0.6    0.802490
y_pred_0.7    0.891505
y_pred_0.8    0.929288
y_pred_0.9    0.962373
Name: 1, dtype: float64
1
y_pred_0.1    0.504354
y_pred_0.2    0.662241
y_pred_0.3    0.726090
y_pred_0.4    0.710099
y_pred_0.5    0.732608
y_pred_0.6    0.802490
y_pred_0.7    0.891505
y_pred_0.8    0.929288
y_pred_0.9    0.962373
Name: 5, dtype: float64
5

I've strictly follow the computation of GBDT with quantile objective and manually calculated the result. It is consistent with the tree output of LightGBM.

So in general, I don't think this is bug. It would be better if @TomekPro could provide the data, so that we can check the tree structures in the model. Your help is really appreciated.

Restricting the deciles to be monotone is an interesting question and can be take as a feature request.

If no further evidence shows that this is really a bug, I think this issue should be closed.

@TomekPro
Copy link
Author

@shiyu1994 I'm sorry but I cannot provide the data as it is confidential but you can simulate it from the distribution I posted above. I suspect that this case is most common for such distributions with a large group of outliers.

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Mar 17, 2021

Ok, thanks for your information. I can try to build another example with labels similar to your distribution. But in general, the output for a data point with quantile objective is not guaranteed to be monotone with the alpha hyperparameter, as the example showed above.

Is the monotonicity essential for you application?

@TomekPro
Copy link
Author

At the end I haven't used LGBM but in general I believe this is a really common case. You want not only have point estimate but as well some kind of confidence interval for your prediction.

@shiyu1994
Copy link
Collaborator

I think to implement such a constraint requires to fitting models with different alpha together. We need to pass all the alphas to the boosting process, maintaining boosting models for different alphas simultaneously and setting appropriate restriction so that the prediction values for each data point grows monotonically with alpha.
This is a complicated task and requires new boosting algorithm design. We may investigate this in the future.
For now, let me put this into the feature request & voting hub.

@Bougeant
Copy link

Not sure how to vote for this feature request to be prioritised, but it obviously is critical in most applications that the quantiles be ranked correctly.

@lorentzenchr
Copy link
Contributor

My 5 cents:
I would also not consider this as a bug.

What most people do not consider: If different quantiles clash/are not monotone, then the uncertainty of that prediction is veeeeery likely high. i.e. the crossing is withing the estimation uncertainty.

It is much easier to enforce such monotonicity constraints with linear models (as R's quantreg does). For tree based models, one would need the same tree split points to make such a constraint possible, very much like quantile regression forests do.

@RektPunk
Copy link
Contributor

RektPunk commented Mar 8, 2023

Hey, I also suffered from the problem of not maintaining monotonicity between quantiles, called the crossing problem.
To solve this, I suggested a method written in the issue. I hope you guys check it out and feel free to discuss it.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants