Probability measure for features #4605

parsiad · 2021-09-14T03:47:53Z

Summary

Ability to specify a probability measure feature_measure over features corresponding to their relative importance. Then, use this measure to sample the features at each iteration.

Please let me know if this functionality already exists.

Motivation

When feature_fraction < 1, a subset of features is selected at each iteration (e.g., feature_fraction = 0.8 means that 80% of features will be randomly selected). However, some features are more important than others and being able to encode this may result in better models.

Description

Description in pseudocode for clarity:

for _ in range(num_iterations):
    n = int(feature_fraction * num_features)
    feature_indices = np.random.choice(n_features, size=[n], replace=False, p=feature_measure)
    create_next_tree(feature_indices)

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-09-14T15:24:49Z

Thanks for your interest in LightGBM and for taking the time to write up this feature request!

No such functionality exists in LightGBM today, and I'm personally not convinced that it should. As a tree-based supervised learning framework, LightGBM already has feature selection built into it...features that don't have much explanatory power will be chosen for splits less often (or in some cases, not at all).

If you have some evidence or research you could point to that suggests that a feature like this can lead to better performance in tree-based frameworks like LightGBM we'd love to see it.

If your goal is to produce a model with as few features as possible given an acceptable level of performance, maybe with the aim of removing unimportant features to make model deployment less expensive, you could:

try increasing the values of parameters like min_gain_to_split and min_data_in_leaf, which help to avoid creating some splits that only provide a small improvement
train a fairly shallow model (with more conservative values of num_iterations / num_leaves / max_depth), subset your training data to only features that were chosen for splits in that model, then train a new, deeper model

If you have some other goal with this approach, let me know and I or one of the other maintainers here might be able to describe an alternative way to achieve that goal using LightGBM's existing API.

parsiad · 2021-09-14T19:40:15Z

Thank you for your response, James.

My goal is not to reduce the cost of inference.

While I agree with you that selecting features based on their gain serves the purpose of feature selection, I was curious about the case of very small feature_fraction, where it may be possible to get unlucky in the initial feature selection (my understanding is that feature_fraction selects out a subset of the features at the beginning of tree construction and then features are picked to split on based on gain). OTOH, you could argue that increasing num_iterations would get around this.

I do not have any research suggesting it would be beneficial but was hoping to test it. Can I trouble you for a code pointer of where it would be best to implement this?

jameslamb · 2021-09-14T21:53:42Z

you could argue that increasing num_iterations would get around this.

Yep, exactly this. You should be training for enough iterations that eventually all features make it into a few trees, even for a low value of feature_fraction. The value of specifying feature_fraction is to allow LightGBM to find useful "pockets" of feature combinations which might not otherwise be found in the presence of some features that are consistently chosen for the first or second split in trees.

I also want to note that coming up with those sort of "feature weights" and aligning them with each iteration might be difficult. Recall that for gradient boosting, each tree is fit to something like the residuals of the model so far. So the relative importance (e.g. likelihood of being chosen for a split) of features is different at each iteration. This is different from non-boosting approaches like Random Forest where each tree is independent of all others.

Can I trouble you for a code pointer of where it would be best to implement this?

The logic of selecting a subset of features is handled in the ColSampler class. Implementing something like this would probably involve updating ColSampler::ResetByTree(), whose definition can be found below.

LightGBM/src/treelearner/col_sampler.hpp

Lines 74 to 89 in d517ba1

    
             void ResetByTree() { 
        
               if (need_reset_bytree_) { 
        
                 std::memset(is_feature_used_.data(), 0, 
        
                             sizeof(int8_t) * is_feature_used_.size()); 
        
                 used_feature_indices_ = random_.Sample( 
        
                     static_cast<int>(valid_feature_indices_.size()), used_cnt_bytree_); 
        
                 int omp_loop_size = static_cast<int>(used_feature_indices_.size()); 
        
           #pragma omp parallel for schedule(static, 512) if (omp_loop_size >= 1024) 
        
                 for (int i = 0; i < omp_loop_size; ++i) { 
        
                   int used_feature = valid_feature_indices_[used_feature_indices_[i]]; 
        
                   int inner_feature_index = train_data_->InnerFeatureIndex(used_feature); 
        
                   is_feature_used_[inner_feature_index] = 1; 
        
                 } 
        
               } 
        
             }

But before you go do work to try to add this, I want to set the right expectation. I personally am not convinced that adding such a feature would be worth the complexity, and would not support the addition of this feature to LightGBM.

I'm only one maintainer though, and others might have different opinions (@StrikerRUS @Laurae2 @btrotta @shiyu1994 ).

jaguerrerod · 2021-09-17T14:31:27Z

I think it is interesting but by the opposite reason, not for increase but for reduce the probability of most powerful features.
Sometimes you have a feature (o features) that are very important in train set and model overuse them (not overfit them, only overuse).
If the problem isn't stationary and test set is potentially different the generalization power of the model will drop if these the ralation of these features with the target change in recent period.
This is a frequent case in stock prices prediction, and the problem of very powerful predictors and the need of reduce their presence in the model is dealing with 'feature neutralization' tricks.
A way of doing 'feature neutralization' in gbm would be reduce the probability of selecting a feature that has a lot of importance for the model.

shiyu1994 · 2021-09-22T05:56:36Z

@parsiad Thanks for your proposal!

For supporting the preference over some features, we already have cost-effective gradient boosting penalty, you may see
https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html?highlight=cost-eff#cost-efficient-gradient-boosting
for reference.
Giving different probabilities over features is some what like defining a cost for each features. We can assign higher cost to features we want to select less often.
So I think the cost-effective gradient boosting penalty can achieve similar effect with @parsiad 's proposal. If I misunderstand your motivation, please feel free to correct me.

Fish-Soup · 2021-09-28T09:20:07Z

I agree this would be useful for the reasons similar to those @blindape2 described. I also have the case where I have some features that might be very informative in fitting but they are hard to reproduce when forecasting forward. I end up in this case where i would rather the model not use these features but if they are the only way of getting a good fit i would like them to be used.

My example is energy demand forecasting, I would prefer the model to use say day of year and weather to predict the demand. But often a very powerful feature is to use the previous days demand. Using the previous days demand is fine if im only predicitng a couple of days forward but if i need to predict a year in advance it becomes harder to use.

Ideally I would have the model train using weather and time of year and when its learning slows down then it can add the previous days demand feature.

I imagine there are also other times when we have correlated features we would rather the model use one feature rather than the other due to domain knowledge.

shiyu1994 · 2021-10-04T03:56:44Z

@parsiad @Fish-Soup @blindape2 @jameslamb Thank you all for the discussion. Probability measure for features can be useful in some cases. We can add this to the #2302 first.
@parsiad Does the cost-effective boosting meet your requirement in your application scenario?

parsiad · 2021-10-12T22:53:32Z

@shiyu1994: I apologize for the late response. At a first glance, it looks like cost-efficient gradient boosting will serve the needed purpose. At the time, I have not gotten a chance to try it on our data as my collaborator and I became busy with another project. I'll update this thread as soon as we try it.

If you prefer, please feel free to close this thread and I can revive it with an update in the future.

shiyu1994 · 2021-10-13T03:40:01Z

@parsiad Thanks. I'll close this issue first. Please feel free to reopen it whenever you feel the need of further discussion.

jameslamb · 2022-02-11T02:08:19Z

Adding a reference to #4962 here

bradhilton · 2023-09-12T03:46:33Z

I would also like to see support for feature weights. For my use case, I have many correlated features and I would like to be able to specify lower sampling probabilities for clusters of similar features.

jameslamb · 2023-10-06T21:55:44Z

Some additional resources and reasoning for supporting user control over feature selection probabilities has been added in #6129.

Let's please use this issue (#4605) as the main feature request for providing feature selection probabilities to LightGBM.

joshdunnlime · 2024-06-30T13:57:54Z

I wish to use feature sampling to initialise a model to be trained only on a subset of variables (or single variable) for the first n trees. Then, to train the following n+i trees on those variables not in the initial subset.

The aim here is gain as much information from the primary independent variables as possible. The subsequent trees will then explain all information deviating from the primary model. I see this having these advantages:

better model explainability. In this case, the primary variables should account for >90% of the predictive power. The secondary variables should model any deviation from this. A two step training process with help split these and allow to me effectively see how the secondary variables cause the predictions to adjust primary predictions.
handling noise. The primary and secondary variables have different levels of noise in their measurements. I can adjust the regluarisation/learning rates/other hyperparams for each stage accordingly.
favouring the primary variable. At prediction time, this variable is more reliable/guaranteed. Therefore, maximum predictive power should be assigned to it.

From what I can tell, all variables need to be in the training matrix for all train/retrain steps.Therefore, the only way to handle this is by passing all variables and weighting each column with {1, 0} at each step.

Naturally, there are extensions to this kind of column sampling, e.g. adding a third step to learn interactions, using non-binary values for columns, etc.

pengxiao-song · 2024-11-18T06:25:02Z

I wish to use feature sampling to initialise a model to be trained only on a subset of variables (or single variable) for the first n trees. Then, to train the following n+i trees on those variables not in the initial subset.

The aim here is gain as much information from the primary independent variables as possible. The subsequent trees will then explain all information deviating from the primary model. I see this having these advantages:

better model explainability. In this case, the primary variables should account for >90% of the predictive power. The secondary variables should model any deviation from this. A two step training process with help split these and allow to me effectively see how the secondary variables cause the predictions to adjust primary predictions.

handling noise. The primary and secondary variables have different levels of noise in their measurements. I can adjust the regluarisation/learning rates/other hyperparams for each stage accordingly.

favouring the primary variable. At prediction time, this variable is more reliable/guaranteed. Therefore, maximum predictive power should be assigned to it.

From what I can tell, all variables need to be in the training matrix for all train/retrain steps.Therefore, the only way to handle this is by passing all variables and weighting each column with {1, 0} at each step.

Naturally, there are extensions to this kind of column sampling, e.g. adding a third step to learn interactions, using non-binary values for columns, etc.

This is exactly the same as my requirement!! Looking forward to a solution.

jameslamb added the question label Sep 14, 2021

jameslamb added the awaiting response label Sep 14, 2021

no-response bot removed the awaiting response label Sep 14, 2021

shiyu1994 mentioned this issue Oct 4, 2021

Feature Requests & Voting Hub #2302

Open

shiyu1994 closed this as completed Oct 13, 2021

jameslamb added the feature request label Jan 6, 2022

jameslamb mentioned this issue Jan 6, 2022

[question] is there a way to define the weights of features during training? #4931

Open

StrikerRUS mentioned this issue Jan 20, 2022

[New idea - Feature Request] an option to exclude features from sampling #4962

Closed

This comment was marked as off-topic.

Sign in to view

github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023

microsoft unlocked this conversation Aug 18, 2023

jameslamb mentioned this issue Sep 9, 2023

Pass in a probability for each feature when sampling features #6087

Closed

jameslamb mentioned this issue Oct 6, 2023

[Python package] Weighted Feature Sampling in LGBM Random Forest Models #6129

Closed

jameslamb mentioned this issue Feb 24, 2024

Explainable boosting parameters #6335

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probability measure for features #4605

Probability measure for features #4605

parsiad commented Sep 14, 2021

jameslamb commented Sep 14, 2021

parsiad commented Sep 14, 2021

jameslamb commented Sep 14, 2021 •

edited

Loading

jaguerrerod commented Sep 17, 2021

shiyu1994 commented Sep 22, 2021

Fish-Soup commented Sep 28, 2021

shiyu1994 commented Oct 4, 2021

parsiad commented Oct 12, 2021 •

edited

Loading

shiyu1994 commented Oct 13, 2021

jameslamb commented Feb 11, 2022

This comment was marked as off-topic.

bradhilton commented Sep 12, 2023

jameslamb commented Oct 6, 2023 •

edited

Loading

joshdunnlime commented Jun 30, 2024 •

edited

Loading

pengxiao-song commented Nov 18, 2024

Probability measure for features #4605

Probability measure for features #4605

Comments

parsiad commented Sep 14, 2021

Summary

Motivation

Description

jameslamb commented Sep 14, 2021

parsiad commented Sep 14, 2021

jameslamb commented Sep 14, 2021 • edited Loading

jaguerrerod commented Sep 17, 2021

shiyu1994 commented Sep 22, 2021

Fish-Soup commented Sep 28, 2021

shiyu1994 commented Oct 4, 2021

parsiad commented Oct 12, 2021 • edited Loading

shiyu1994 commented Oct 13, 2021

jameslamb commented Feb 11, 2022

This comment was marked as off-topic.

bradhilton commented Sep 12, 2023

jameslamb commented Oct 6, 2023 • edited Loading

joshdunnlime commented Jun 30, 2024 • edited Loading

pengxiao-song commented Nov 18, 2024

jameslamb commented Sep 14, 2021 •

edited

Loading

parsiad commented Oct 12, 2021 •

edited

Loading

jameslamb commented Oct 6, 2023 •

edited

Loading

joshdunnlime commented Jun 30, 2024 •

edited

Loading