-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[question] is there a way to define the weights of features during training? #4931
Comments
Thanks for using LightGBM! LightGBM doesn't currently support something like this directly. There is an existing feature request for it though! See the discussion in #4605. |
@abedshantti The |
Exactly, this is the feature_contri parameter. You can build it just like this: var_weights = [1] * len(x_train.columns) And then adding it to the lgb.train parameters as feature_contri=var_weights. However, this seems to be a complicated part of ML. Apparently, by doing this you're recalculating gain as gain[i] = max(0, feature_contri[i]) * gain[i], so I would expect the algorithm to split by 'noisy variable 1' only when it reduces so much the entropy that 'noisy variable 1' still generates the best possible split even multiplying gain on 'noisy variable 1' by a factor between 0 and 1. This way you eliminate weaker splits by this variable and you only keep the stronger ones, splitting preferably by other variables. In your case, seems that this could solve your problem about noisy variables because you expect the stronger splits not to depend so much on noise as the weaker ones. Nevertheless, there are other cases where we could need to reduce any variable's importance. For instance, if we know that our training set is somehow biased on this variable, we might want to reduce this variable's importance in the model, so the model doesn't depend so much on this variable and we don't have big differences between training performance and future predictions. In this case it doesn't seem so obvious that the stronger splits are the ones we want to keep. What can we do about this? |
I think there are multiple ways to increase/decrease the feature's importance. Like when do feature sampling, increasing the chance of some feature while decreasing some. I think current way of increasing the weight of a variable's gain in splitting is another way to do this, thanks for the package team's work. |
I have a dataset with some noisy variables and I would like to use lightGBM in a way such that I can minimise the impact of those features while keeping them in the dataset. I know that the
feature_importance
attribute outputs the feature importances after training, but is there a way to penalise the importance during prediction or even possibly allow the lightGBM to construct the boosting trees whilst focusing on the main features. Ideally I am looking for a vector like[1, 1, 1, 0.1, 0.1, 0.1]
, where the three 1s stand for features I would like the model to focus on, and the 0.1s are the features that I would still like the model to consider, but at a much lower extent.The text was updated successfully, but these errors were encountered: