-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantile LightGBM - inconsistent deciles #3447
Comments
I think it may be a bug, will investigate it when have time. |
@TomekPro could you provide a data for reproducibly? the small dataset or randomly generated dataset will be better. |
gently ping @TomekPro for example of data |
Hi, having the same problem but didn't think that it's a bug. When I see this topic i see its worth questioning that. Diving into theory in order to see if it's a bug, I have a feature that can separate the data into two normal distribution(open to discussion, but it's more normally dist.) which is in most important 10 features. However, lgbm doesn't guarantee to use the this feature at the top of the trees, so doesn't guarantee that the final output is normal. I'm pointing this out because I wonder even if the data is not normally distributed, trees can find normally distributed subparts of the data. And it may resolve the problem of innormally distributed target, trees can find normally distributed subparts of it. Is it the case for trees? I sometimes catastrophically end up with quantiles such as this: In general, quantiles are mixing up. For the mix of p30-p50-p70, @TomekPro 's suggestion seems to be plausible, however,in some cases I have that p50 is out of the range of p10-p90. Sometimes, p50 gives more accurate results but p10-p90 interval doesn't capture the real value at all. Again, i'm not sure whether it's bug or not, or may be I'm modelling something wrong, or it's just a randomness since we don't fit an actual normal distribution with (mean,std) parameters (NGBoost does that) but we estimate it, or using lots of features etc. Again, i'm noting that when I run the same quantile with same model/features and simply average the results, I have more consistent results. |
I think with GBDT, Here's an example where the predicted decile is not monotone. import lightgbm as lgb
from sklearn.base import *
import pandas as pd
import numpy as np
class QuantileEstimator(BaseEstimator, RegressorMixin):
def __init__(self, model):
"""
"""
self.alphas = [round(x, 1) for x in np.arange(0.1, 1.0, 0.1)]
self.model_factory = []
self.model = model
super().__init__()
def fit(self, X, y=None):
for a in self.alphas:
model_i = clone(self.model)
model_i = model_i.set_params(**{'alpha': a})
model_fitted = model_i.fit(X, y)
self.model_factory.append(model_fitted)
return self
def predict(self, X):
predictions = pd.DataFrame()
for a, m in zip(self.alphas, self.model_factory):
predictions["y_pred_" + str(a)] = m.predict(X, raw_score=True)
return predictions
qet = QuantileEstimator(lgb.LGBMRegressor(n_jobs=1,
seed=1234,
learning_rate=0.1,
reg_sqrt=True,
objective = 'quantile',
n_estimators=2,
min_data_in_leaf=1,
num_leaves=3,
boost_from_average=True,
verbose=2))
np.random.seed(2)
X = np.random.rand(10, 20)
y = np.random.rand(10)
qet.fit(X, y)
pred = qet.predict(X)
for i in range(pred.shape[0]):
for j in range(pred.shape[1] - 1):
if pred.iloc[i, j] > pred.iloc[i, j + 1]:
print(pred.iloc[i, :])
print(i) And the output is
I've strictly follow the computation of GBDT with So in general, I don't think this is bug. It would be better if @TomekPro could provide the data, so that we can check the tree structures in the model. Your help is really appreciated. Restricting the deciles to be monotone is an interesting question and can be take as a feature request. If no further evidence shows that this is really a bug, I think this issue should be closed. |
@shiyu1994 I'm sorry but I cannot provide the data as it is confidential but you can simulate it from the distribution I posted above. I suspect that this case is most common for such distributions with a large group of outliers. |
Ok, thanks for your information. I can try to build another example with labels similar to your distribution. But in general, the output for a data point with Is the monotonicity essential for you application? |
At the end I haven't used LGBM but in general I believe this is a really common case. You want not only have point estimate but as well some kind of confidence interval for your prediction. |
I think to implement such a constraint requires to fitting models with different |
Not sure how to vote for this feature request to be prioritised, but it obviously is critical in most applications that the quantiles be ranked correctly. |
My 5 cents: What most people do not consider: If different quantiles clash/are not monotone, then the uncertainty of that prediction is veeeeery likely high. i.e. the crossing is withing the estimation uncertainty. It is much easier to enforce such monotonicity constraints with linear models (as R's quantreg does). For tree based models, one would need the same tree split points to make such a constraint possible, very much like quantile regression forests do. |
Hey, I also suffered from the problem of not maintaining monotonicity between quantiles, called the crossing problem. |
This issue has been automatically locked since there has not been any recent activity since it was closed. |
Hi, I'm doing Quantile LightGBM, defined as follows:
Where QuantileEstimator is just:
Sometimes it happens that following decile for individual predictiions has lower value than the previous one.
Any idea why that's happening? I understand that those are different models, but still as I set the same seed I would expect that the results for different deciles should be monotonic.
The text was updated successfully, but these errors were encountered: