[question] is there a way to define the weights of features during training? #4931

abedshantti · 2022-01-06T17:06:49Z

I have a dataset with some noisy variables and I would like to use lightGBM in a way such that I can minimise the impact of those features while keeping them in the dataset. I know that the feature_importance attribute outputs the feature importances after training, but is there a way to penalise the importance during prediction or even possibly allow the lightGBM to construct the boosting trees whilst focusing on the main features. Ideally I am looking for a vector like [1, 1, 1, 0.1, 0.1, 0.1], where the three 1s stand for features I would like the model to focus on, and the 0.1s are the features that I would still like the model to consider, but at a much lower extent.

The text was updated successfully, but these errors were encountered:

jameslamb · 2022-01-06T17:12:05Z

Thanks for using LightGBM!

LightGBM doesn't currently support something like this directly.

There is an existing feature request for it though! See the discussion in #4605.

btrotta · 2022-02-12T02:49:34Z

@abedshantti The feature_contri parameter allows you to weight the features - higher weight means LightGBM is more likely to branch on that feature in a tree. https://lightgbm.readthedocs.io/en/latest/Parameters.html#feature_contri

alejandrogomez97 · 2023-12-14T11:49:02Z

Exactly, this is the feature_contri parameter. You can build it just like this:

var_weights = [1] * len(x_train.columns)
var_weights[list(x_train.columns).index('noisy variable 1')] = 0.25
var_weights[list(x_train.columns).index('noisy variable 2')] = 0.4

And then adding it to the lgb.train parameters as feature_contri=var_weights.

However, this seems to be a complicated part of ML.

Apparently, by doing this you're recalculating gain as gain[i] = max(0, feature_contri[i]) * gain[i], so I would expect the algorithm to split by 'noisy variable 1' only when it reduces so much the entropy that 'noisy variable 1' still generates the best possible split even multiplying gain on 'noisy variable 1' by a factor between 0 and 1. This way you eliminate weaker splits by this variable and you only keep the stronger ones, splitting preferably by other variables.

In your case, seems that this could solve your problem about noisy variables because you expect the stronger splits not to depend so much on noise as the weaker ones.

Nevertheless, there are other cases where we could need to reduce any variable's importance. For instance, if we know that our training set is somehow biased on this variable, we might want to reduce this variable's importance in the model, so the model doesn't depend so much on this variable and we don't have big differences between training performance and future predictions. In this case it doesn't seem so obvious that the stronger splits are the ones we want to keep. What can we do about this?

yuanqingye · 2024-01-09T05:34:05Z

I think there are multiple ways to increase/decrease the feature's importance. Like when do feature sampling, increasing the chance of some feature while decreasing some. I think current way of increasing the weight of a variable's gain in splitting is another way to do this, thanks for the package team's work.

jameslamb added the question label Jan 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] is there a way to define the weights of features during training? #4931

[question] is there a way to define the weights of features during training? #4931

abedshantti commented Jan 6, 2022 •

edited

Loading

jameslamb commented Jan 6, 2022

btrotta commented Feb 12, 2022

alejandrogomez97 commented Dec 14, 2023

yuanqingye commented Jan 9, 2024

[question] is there a way to define the weights of features during training? #4931

[question] is there a way to define the weights of features during training? #4931

Comments

abedshantti commented Jan 6, 2022 • edited Loading

jameslamb commented Jan 6, 2022

btrotta commented Feb 12, 2022

alejandrogomez97 commented Dec 14, 2023

yuanqingye commented Jan 9, 2024

abedshantti commented Jan 6, 2022 •

edited

Loading