-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Probability measure for features #4605
Comments
Thanks for your interest in LightGBM and for taking the time to write up this feature request! No such functionality exists in LightGBM today, and I'm personally not convinced that it should. As a tree-based supervised learning framework, LightGBM already has feature selection built into it...features that don't have much explanatory power will be chosen for splits less often (or in some cases, not at all). If you have some evidence or research you could point to that suggests that a feature like this can lead to better performance in tree-based frameworks like LightGBM we'd love to see it. If your goal is to produce a model with as few features as possible given an acceptable level of performance, maybe with the aim of removing unimportant features to make model deployment less expensive, you could:
If you have some other goal with this approach, let me know and I or one of the other maintainers here might be able to describe an alternative way to achieve that goal using LightGBM's existing API. |
Thank you for your response, James. My goal is not to reduce the cost of inference. While I agree with you that selecting features based on their gain serves the purpose of feature selection, I was curious about the case of very small I do not have any research suggesting it would be beneficial but was hoping to test it. Can I trouble you for a code pointer of where it would be best to implement this? |
Yep, exactly this. You should be training for enough iterations that eventually all features make it into a few trees, even for a low value of I also want to note that coming up with those sort of "feature weights" and aligning them with each iteration might be difficult. Recall that for gradient boosting, each tree is fit to something like the residuals of the model so far. So the relative importance (e.g. likelihood of being chosen for a split) of features is different at each iteration. This is different from non-boosting approaches like Random Forest where each tree is independent of all others.
The logic of selecting a subset of features is handled in the LightGBM/src/treelearner/col_sampler.hpp Lines 74 to 89 in d517ba1
But before you go do work to try to add this, I want to set the right expectation. I personally am not convinced that adding such a feature would be worth the complexity, and would not support the addition of this feature to LightGBM. I'm only one maintainer though, and others might have different opinions (@StrikerRUS @Laurae2 @btrotta @shiyu1994 ). |
I think it is interesting but by the opposite reason, not for increase but for reduce the probability of most powerful features. |
@parsiad Thanks for your proposal! For supporting the preference over some features, we already have cost-effective gradient boosting penalty, you may see |
I agree this would be useful for the reasons similar to those @blindape2 described. I also have the case where I have some features that might be very informative in fitting but they are hard to reproduce when forecasting forward. I end up in this case where i would rather the model not use these features but if they are the only way of getting a good fit i would like them to be used. My example is energy demand forecasting, I would prefer the model to use say day of year and weather to predict the demand. But often a very powerful feature is to use the previous days demand. Using the previous days demand is fine if im only predicitng a couple of days forward but if i need to predict a year in advance it becomes harder to use. Ideally I would have the model train using weather and time of year and when its learning slows down then it can add the previous days demand feature. I imagine there are also other times when we have correlated features we would rather the model use one feature rather than the other due to domain knowledge. |
@parsiad @Fish-Soup @blindape2 @jameslamb Thank you all for the discussion. Probability measure for features can be useful in some cases. We can add this to the #2302 first. |
@shiyu1994: I apologize for the late response. At a first glance, it looks like cost-efficient gradient boosting will serve the needed purpose. At the time, I have not gotten a chance to try it on our data as my collaborator and I became busy with another project. I'll update this thread as soon as we try it. If you prefer, please feel free to close this thread and I can revive it with an update in the future. |
@parsiad Thanks. I'll close this issue first. Please feel free to reopen it whenever you feel the need of further discussion. |
Adding a reference to #4962 here |
This comment was marked as off-topic.
This comment was marked as off-topic.
I would also like to see support for feature weights. For my use case, I have many correlated features and I would like to be able to specify lower sampling probabilities for clusters of similar features. |
I wish to use feature sampling to initialise a model to be trained only on a subset of variables (or single variable) for the first n trees. Then, to train the following n+i trees on those variables not in the initial subset. The aim here is gain as much information from the primary independent variables as possible. The subsequent trees will then explain all information deviating from the primary model. I see this having these advantages:
From what I can tell, all variables need to be in the training matrix for all train/retrain steps.Therefore, the only way to handle this is by passing all variables and weighting each column with {1, 0} at each step. Naturally, there are extensions to this kind of column sampling, e.g. adding a third step to learn interactions, using non-binary values for columns, etc. |
This is exactly the same as my requirement!! Looking forward to a solution. |
Summary
Ability to specify a probability measure
feature_measure
over features corresponding to their relative importance. Then, use this measure to sample the features at each iteration.Please let me know if this functionality already exists.
Motivation
When
feature_fraction < 1
, a subset of features is selected at each iteration (e.g.,feature_fraction = 0.8
means that 80% of features will be randomly selected). However, some features are more important than others and being able to encode this may result in better models.Description
Description in pseudocode for clarity:
The text was updated successfully, but these errors were encountered: