Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probability measure for features #4605

Closed
parsiad opened this issue Sep 14, 2021 · 15 comments
Closed

Probability measure for features #4605

parsiad opened this issue Sep 14, 2021 · 15 comments

Comments

@parsiad
Copy link
Contributor

parsiad commented Sep 14, 2021

Summary

Ability to specify a probability measure feature_measure over features corresponding to their relative importance. Then, use this measure to sample the features at each iteration.

Please let me know if this functionality already exists.

Motivation

When feature_fraction < 1, a subset of features is selected at each iteration (e.g., feature_fraction = 0.8 means that 80% of features will be randomly selected). However, some features are more important than others and being able to encode this may result in better models.

Description

Description in pseudocode for clarity:

for _ in range(num_iterations):
    n = int(feature_fraction * num_features)
    feature_indices = np.random.choice(n_features, size=[n], replace=False, p=feature_measure)
    create_next_tree(feature_indices)
@jameslamb
Copy link
Collaborator

Thanks for your interest in LightGBM and for taking the time to write up this feature request!

No such functionality exists in LightGBM today, and I'm personally not convinced that it should. As a tree-based supervised learning framework, LightGBM already has feature selection built into it...features that don't have much explanatory power will be chosen for splits less often (or in some cases, not at all).

If you have some evidence or research you could point to that suggests that a feature like this can lead to better performance in tree-based frameworks like LightGBM we'd love to see it.

If your goal is to produce a model with as few features as possible given an acceptable level of performance, maybe with the aim of removing unimportant features to make model deployment less expensive, you could:

  • try increasing the values of parameters like min_gain_to_split and min_data_in_leaf, which help to avoid creating some splits that only provide a small improvement
  • train a fairly shallow model (with more conservative values of num_iterations / num_leaves / max_depth), subset your training data to only features that were chosen for splits in that model, then train a new, deeper model

If you have some other goal with this approach, let me know and I or one of the other maintainers here might be able to describe an alternative way to achieve that goal using LightGBM's existing API.

@parsiad
Copy link
Contributor Author

parsiad commented Sep 14, 2021

Thank you for your response, James.

My goal is not to reduce the cost of inference.

While I agree with you that selecting features based on their gain serves the purpose of feature selection, I was curious about the case of very small feature_fraction, where it may be possible to get unlucky in the initial feature selection (my understanding is that feature_fraction selects out a subset of the features at the beginning of tree construction and then features are picked to split on based on gain). OTOH, you could argue that increasing num_iterations would get around this.

I do not have any research suggesting it would be beneficial but was hoping to test it. Can I trouble you for a code pointer of where it would be best to implement this?

@jameslamb
Copy link
Collaborator

jameslamb commented Sep 14, 2021

you could argue that increasing num_iterations would get around this.

Yep, exactly this. You should be training for enough iterations that eventually all features make it into a few trees, even for a low value of feature_fraction. The value of specifying feature_fraction is to allow LightGBM to find useful "pockets" of feature combinations which might not otherwise be found in the presence of some features that are consistently chosen for the first or second split in trees.

I also want to note that coming up with those sort of "feature weights" and aligning them with each iteration might be difficult. Recall that for gradient boosting, each tree is fit to something like the residuals of the model so far. So the relative importance (e.g. likelihood of being chosen for a split) of features is different at each iteration. This is different from non-boosting approaches like Random Forest where each tree is independent of all others.

Can I trouble you for a code pointer of where it would be best to implement this?

The logic of selecting a subset of features is handled in the ColSampler class. Implementing something like this would probably involve updating ColSampler::ResetByTree(), whose definition can be found below.

void ResetByTree() {
if (need_reset_bytree_) {
std::memset(is_feature_used_.data(), 0,
sizeof(int8_t) * is_feature_used_.size());
used_feature_indices_ = random_.Sample(
static_cast<int>(valid_feature_indices_.size()), used_cnt_bytree_);
int omp_loop_size = static_cast<int>(used_feature_indices_.size());
#pragma omp parallel for schedule(static, 512) if (omp_loop_size >= 1024)
for (int i = 0; i < omp_loop_size; ++i) {
int used_feature = valid_feature_indices_[used_feature_indices_[i]];
int inner_feature_index = train_data_->InnerFeatureIndex(used_feature);
is_feature_used_[inner_feature_index] = 1;
}
}
}

But before you go do work to try to add this, I want to set the right expectation. I personally am not convinced that adding such a feature would be worth the complexity, and would not support the addition of this feature to LightGBM.

I'm only one maintainer though, and others might have different opinions (@StrikerRUS @Laurae2 @btrotta @shiyu1994 ).

@jaguerrerod
Copy link

I think it is interesting but by the opposite reason, not for increase but for reduce the probability of most powerful features.
Sometimes you have a feature (o features) that are very important in train set and model overuse them (not overfit them, only overuse).
If the problem isn't stationary and test set is potentially different the generalization power of the model will drop if these the ralation of these features with the target change in recent period.
This is a frequent case in stock prices prediction, and the problem of very powerful predictors and the need of reduce their presence in the model is dealing with 'feature neutralization' tricks.
A way of doing 'feature neutralization' in gbm would be reduce the probability of selecting a feature that has a lot of importance for the model.

@shiyu1994
Copy link
Collaborator

@parsiad Thanks for your proposal!

For supporting the preference over some features, we already have cost-effective gradient boosting penalty, you may see
https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html?highlight=cost-eff#cost-efficient-gradient-boosting
for reference.
Giving different probabilities over features is some what like defining a cost for each features. We can assign higher cost to features we want to select less often.
So I think the cost-effective gradient boosting penalty can achieve similar effect with @parsiad 's proposal. If I misunderstand your motivation, please feel free to correct me.

@Fish-Soup
Copy link

I agree this would be useful for the reasons similar to those @blindape2 described. I also have the case where I have some features that might be very informative in fitting but they are hard to reproduce when forecasting forward. I end up in this case where i would rather the model not use these features but if they are the only way of getting a good fit i would like them to be used.

My example is energy demand forecasting, I would prefer the model to use say day of year and weather to predict the demand. But often a very powerful feature is to use the previous days demand. Using the previous days demand is fine if im only predicitng a couple of days forward but if i need to predict a year in advance it becomes harder to use.

Ideally I would have the model train using weather and time of year and when its learning slows down then it can add the previous days demand feature.

I imagine there are also other times when we have correlated features we would rather the model use one feature rather than the other due to domain knowledge.

@shiyu1994
Copy link
Collaborator

@parsiad @Fish-Soup @blindape2 @jameslamb Thank you all for the discussion. Probability measure for features can be useful in some cases. We can add this to the #2302 first.
@parsiad Does the cost-effective boosting meet your requirement in your application scenario?

@parsiad
Copy link
Contributor Author

parsiad commented Oct 12, 2021

@shiyu1994: I apologize for the late response. At a first glance, it looks like cost-efficient gradient boosting will serve the needed purpose. At the time, I have not gotten a chance to try it on our data as my collaborator and I became busy with another project. I'll update this thread as soon as we try it.

If you prefer, please feel free to close this thread and I can revive it with an update in the future.

@shiyu1994
Copy link
Collaborator

@parsiad Thanks. I'll close this issue first. Please feel free to reopen it whenever you feel the need of further discussion.

@jameslamb
Copy link
Collaborator

Adding a reference to #4962 here

@github-actions

This comment was marked as off-topic.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023
@microsoft microsoft unlocked this conversation Aug 18, 2023
@bradhilton
Copy link

I would also like to see support for feature weights. For my use case, I have many correlated features and I would like to be able to specify lower sampling probabilities for clusters of similar features.

@jameslamb
Copy link
Collaborator

jameslamb commented Oct 6, 2023

Some additional resources and reasoning for supporting user control over feature selection probabilities has been added in #6129.

Let's please use this issue (#4605) as the main feature request for providing feature selection probabilities to LightGBM.

@joshdunnlime
Copy link

joshdunnlime commented Jun 30, 2024

I wish to use feature sampling to initialise a model to be trained only on a subset of variables (or single variable) for the first n trees. Then, to train the following n+i trees on those variables not in the initial subset.

The aim here is gain as much information from the primary independent variables as possible. The subsequent trees will then explain all information deviating from the primary model. I see this having these advantages:

  1. better model explainability. In this case, the primary variables should account for >90% of the predictive power. The secondary variables should model any deviation from this. A two step training process with help split these and allow to me effectively see how the secondary variables cause the predictions to adjust primary predictions.
  2. handling noise. The primary and secondary variables have different levels of noise in their measurements. I can adjust the regluarisation/learning rates/other hyperparams for each stage accordingly.
  3. favouring the primary variable. At prediction time, this variable is more reliable/guaranteed. Therefore, maximum predictive power should be assigned to it.

From what I can tell, all variables need to be in the training matrix for all train/retrain steps.Therefore, the only way to handle this is by passing all variables and weighting each column with {1, 0} at each step.

Naturally, there are extensions to this kind of column sampling, e.g. adding a third step to learn interactions, using non-binary values for columns, etc.

@pengxiao-song
Copy link

I wish to use feature sampling to initialise a model to be trained only on a subset of variables (or single variable) for the first n trees. Then, to train the following n+i trees on those variables not in the initial subset.

The aim here is gain as much information from the primary independent variables as possible. The subsequent trees will then explain all information deviating from the primary model. I see this having these advantages:

  1. better model explainability. In this case, the primary variables should account for >90% of the predictive power. The secondary variables should model any deviation from this. A two step training process with help split these and allow to me effectively see how the secondary variables cause the predictions to adjust primary predictions.
  2. handling noise. The primary and secondary variables have different levels of noise in their measurements. I can adjust the regluarisation/learning rates/other hyperparams for each stage accordingly.
  3. favouring the primary variable. At prediction time, this variable is more reliable/guaranteed. Therefore, maximum predictive power should be assigned to it.

From what I can tell, all variables need to be in the training matrix for all train/retrain steps.Therefore, the only way to handle this is by passing all variables and weighting each column with {1, 0} at each step.

Naturally, there are extensions to this kind of column sampling, e.g. adding a third step to learn interactions, using non-binary values for columns, etc.

This is exactly the same as my requirement!! Looking forward to a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants