-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
catboost-plot-pimp-shapimp #77
Conversation
To summarize, this PR solves/enhances:
|
Hi @danielhomola. First, let me thank you for this great library. The changes in this PR would be very useful to me. Do you plan to merge? PS: here is a writeup about the limitations of gini as a feature importance metric. |
@brunofacca Meanwhile they consider to merge it or not, you can have a look at the Boruta_shap package. It implements almost all the features of my PR (I also discussed with its author to fix some issues and the possible merge with Boruta_py, which could be beneficial to avoid any confusion and be under the scikit contrib flag). I hope it helps. |
Thank you @ThomasBury, for this PR and for the tip. I'm actually testing the Boruta_shap package and it looks great except that it still has low test coverage and a smaller number of contributors. Of course, that is likely to improve as the project matures. |
Did the maintainers consider merging these changes? Is there a decision? Thank you. |
@brunofacca if you're still interested, I packaged 3 All Relevant FS methods here: https://github.com/ThomasBury/arfs still incubating but there are some (unit)tests and the doc is there (I guess there are too many dependencies to be compliant with the scikit-learn contrib requirements). I'm supervising a master thesis, the goal being to study the properties of those methods (so the package is likely to evolve over time). |
That's a nice package @ThomasBury, wish you all the best with it! I much prefer separating these ideas from the original implementation to keep things simple and closer to unix tooling philosophy.. I'll close this PR now if you don't mind. |
Ok thanks for the kind words. Would you be interested in a PR in pure sklearn (so only sklearn estimators, native and permutation importance)? That would mean the package to be more than boruta, a sklearn-flavoured all relevant FS package, instead of the opposite strategy of having different packages and a wrapper on top of them, if compliant with sk-contrib requirements? Or perhaps just the PR with the permutation importance (slower but more accurate and still relevant for small/mid-size data sets)? |
Thank you, @ThomasBury! That's a very nice library and it's is likely to fill what I consider to be a gap in the current ML tooling: I've tried dozens of feature selection strategies (including those that are considered "state of the art") and none of them were effective for a high-dimensional dataset where even the most relevant features are quite noisy. Your strategy of running XGBoost classifier on the entire data set ten times (on BoostARoota), for example, is a nice step towards a more effective feature selection strategy for data with a low signal-to-noise ratio. Would be very interesting to eventually see a comparison between classification performance with subsets of features selected by each of those 3 strategies. I will also give them a try in the near future. |
Hi Thomas, really sorry, totally forgot about your question. I'd be happy to review PR with permutation importance if
|
Hi @danielhomola, I submitted a PR:
With this, you pretty much have the same version than in the ARFS package but without any hard dependencies. KR |
Modifications are:
thanks
KR