Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add support for computing feature_importances in RF #3531

Open
teju85 opened this issue Feb 19, 2021 · 14 comments
Open

[FEA] Add support for computing feature_importances in RF #3531

teju85 opened this issue Feb 19, 2021 · 14 comments
Labels
Algorithm API Change For tracking changes to algorithms that might effect the API CUDA / C++ CUDA issue Cython / Python Cython or Python issue doc Documentation feature request New feature or request improvement Improvement / enhancement to an existing function inactive-30d

Comments

@teju85
Copy link
Member

teju85 commented Feb 19, 2021

Is your feature request related to a problem? Please describe.
RF implementation should support computing feature_importances_ property, just like how it is exposed in sklearn.

Describe the solution you'd like

  1. By default, we should compute normalized feature_importances_ (ie. all the importances across the features sum to 1.0)
  2. Implementation that is done in sklearn is here. We have all of this information in our Node. We just need to, while building the tree, keep accumulating each feature's importance as we keep adding more nodes.
@teju85 teju85 added feature request New feature or request doc Documentation CUDA / C++ CUDA issue Cython / Python Cython or Python issue Algorithm API Change For tracking changes to algorithms that might effect the API improvement Improvement / enhancement to an existing function labels Feb 19, 2021
@JohnZed
Copy link
Contributor

JohnZed commented Feb 19, 2021

Definitely agreed. Not sure we'll have enough bandwidth to get this for 0.19 (given work going to new backend) but should be highly prioritized high after that.

@teju85
Copy link
Member Author

teju85 commented Feb 25, 2021

Here's one use-case that requires this attribute to be present: https://github.com/willb/fraud-notebooks/blob/develop/03-model-random-forest.ipynb

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sooryaa-thiruloga
Copy link

we are interested to use this feature in our use case too.

@beckernick
Copy link
Member

This would also be useful for tools like Boruta, a popular feature selection library that's part of scikit-learn-contrib. There is a Boruta issue asking for support for cuML estimators

@teju85
Copy link
Member Author

teju85 commented Mar 18, 2022

Tagging @vinaydes and @venkywonka to see if we can have Venkat start on this?

@hafarooki
Copy link

hafarooki commented Apr 16, 2022

This is probably not the most efficient implementation, but in case anyone else needs it:

def calculate_importances(nodes, n_features):
    importances = np.zeros((len(nodes), n_features))
    feature_gains = np.zeros(n_features)


    def calculate_node_importances(node, i_root):
        if "gain" not in node:
            return

        samples = node["instance_count"]
        gain = node["gain"]
        feature = node["split_feature"]
        feature_gains[feature] += gain * samples

        for child in node["children"]:
            calculate_node_importances(child, i_root)


    for i, root in enumerate(nodes):
        calculate_node_importances(root, i)
        importances[i] = feature_gains / feature_gains.sum()

    return np.mean(importances, axis=0)

you can see the logic behind it here https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3

@beckernick
Copy link
Member

beckernick commented Jun 29, 2022

Cross linking an issue that asks for this feature and OOB support #3361

@Wulin-Tan
Copy link

it is an important issue worth a look.

@HybridNeos
Copy link

Commenting to re-iterate the usefulness for this feature. I was trying to follow https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html using cuml but it is not currently possible.

@beckernick
Copy link
Member

A user shared a workflow today for which cuML's RF was 20x faster than their prior CPU-based RF. They wanted to use feature importance for feature selection, but weren't able to do so.

@szeka94
Copy link

szeka94 commented Jul 15, 2024

yea, I'm missing this too.

@Avertemp
Copy link

Avertemp commented Aug 5, 2024

Same here. Switched to cuml for feature selection. a really needed feature.

@Zach-Sten
Copy link

Same here. Using RF and need feature importance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithm API Change For tracking changes to algorithms that might effect the API CUDA / C++ CUDA issue Cython / Python Cython or Python issue doc Documentation feature request New feature or request improvement Improvement / enhancement to an existing function inactive-30d
Projects
None yet
Development

No branches or pull requests

10 participants