Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] is there a way to define the weights of features during training? #4931

Open
abedshantti opened this issue Jan 6, 2022 · 4 comments
Labels

Comments

@abedshantti
Copy link

abedshantti commented Jan 6, 2022

I have a dataset with some noisy variables and I would like to use lightGBM in a way such that I can minimise the impact of those features while keeping them in the dataset. I know that the feature_importance attribute outputs the feature importances after training, but is there a way to penalise the importance during prediction or even possibly allow the lightGBM to construct the boosting trees whilst focusing on the main features. Ideally I am looking for a vector like [1, 1, 1, 0.1, 0.1, 0.1], where the three 1s stand for features I would like the model to focus on, and the 0.1s are the features that I would still like the model to consider, but at a much lower extent.

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM!

LightGBM doesn't currently support something like this directly.

There is an existing feature request for it though! See the discussion in #4605.

@btrotta
Copy link
Collaborator

btrotta commented Feb 12, 2022

@abedshantti The feature_contri parameter allows you to weight the features - higher weight means LightGBM is more likely to branch on that feature in a tree. https://lightgbm.readthedocs.io/en/latest/Parameters.html#feature_contri

@alejandrogomez97
Copy link

Exactly, this is the feature_contri parameter. You can build it just like this:

var_weights = [1] * len(x_train.columns)
var_weights[list(x_train.columns).index('noisy variable 1')] = 0.25
var_weights[list(x_train.columns).index('noisy variable 2')] = 0.4

And then adding it to the lgb.train parameters as feature_contri=var_weights.

However, this seems to be a complicated part of ML.

Apparently, by doing this you're recalculating gain as gain[i] = max(0, feature_contri[i]) * gain[i], so I would expect the algorithm to split by 'noisy variable 1' only when it reduces so much the entropy that 'noisy variable 1' still generates the best possible split even multiplying gain on 'noisy variable 1' by a factor between 0 and 1. This way you eliminate weaker splits by this variable and you only keep the stronger ones, splitting preferably by other variables.

In your case, seems that this could solve your problem about noisy variables because you expect the stronger splits not to depend so much on noise as the weaker ones.

Nevertheless, there are other cases where we could need to reduce any variable's importance. For instance, if we know that our training set is somehow biased on this variable, we might want to reduce this variable's importance in the model, so the model doesn't depend so much on this variable and we don't have big differences between training performance and future predictions. In this case it doesn't seem so obvious that the stronger splits are the ones we want to keep. What can we do about this?

@yuanqingye
Copy link

I think there are multiple ways to increase/decrease the feature's importance. Like when do feature sampling, increasing the chance of some feature while decreasing some. I think current way of increasing the weight of a variable's gain in splitting is another way to do this, thanks for the package team's work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants