Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catboost-plot-pimp-shapimp #77

Closed
wants to merge 11 commits into from

Conversation

ThomasBury
Copy link

Modifications are:

  • catboost checks (doesn't allow to change random seed after fitting), cases for catboost estimator
  • plot: boxplot of the var. imp history (confirmed, tentative and rejected are colour-coded)
  • categorical feature for boosting, plot (mimicking more or less the R one), OHE is known to lead to deep and unstable trees. LightGBM and CatBoost have native methods to handle cat. pred, this requires to declare which columns are cat (lightgbm has an auto mode but the columns should be integer encoded and dtype set to category)
  • permutation importance and SHAP importance, (the impurity based var.imp being biased towards large card. and numerical predictors and computed on the train set)
  • testing: modif_tests.py for classification and regression testing (including a random cat. pred with large cardinality)

thanks
KR

@ThomasBury
Copy link
Author

To summarize, this PR solves/enhances:

  • The categorical features (they are detected, encoded. The tree-based models are working better with integer encoding rather than with OHE, which leads to deep and unstable trees). If Catboost is used, then the cat.pred (if any) are set up
  • Work with Catboost sklearn API
  • Allow using sample_weight, for applications like Poisson regression or any requiring weights
  • 3 different feature importances: native, SHAP and permutation. Native being the least consistent (because of the imp. biased towards numerical and large cardinality categorical) but the fastest of the 3.
  • Using lightGBM as default speed up by an order of magnitude the running time
  • Visualization like in the R package

@brunofacca
Copy link

brunofacca commented Aug 2, 2020

Hi @danielhomola. First, let me thank you for this great library.

The changes in this PR would be very useful to me. Do you plan to merge?

PS: here is a writeup about the limitations of gini as a feature importance metric.

@ThomasBury
Copy link
Author

@brunofacca Meanwhile they consider to merge it or not, you can have a look at the Boruta_shap package. It implements almost all the features of my PR (I also discussed with its author to fix some issues and the possible merge with Boruta_py, which could be beneficial to avoid any confusion and be under the scikit contrib flag). I hope it helps.

@brunofacca
Copy link

Thank you @ThomasBury, for this PR and for the tip. I'm actually testing the Boruta_shap package and it looks great except that it still has low test coverage and a smaller number of contributors. Of course, that is likely to improve as the project matures.

@brunofacca
Copy link

Did the maintainers consider merging these changes? Is there a decision? Thank you.

@ThomasBury
Copy link
Author

ThomasBury commented Nov 28, 2020

Did the maintainers consider merging these changes? Is there a decision? Thank you.

@brunofacca if you're still interested, I packaged 3 All Relevant FS methods here: https://github.com/ThomasBury/arfs still incubating but there are some (unit)tests and the doc is there (I guess there are too many dependencies to be compliant with the scikit-learn contrib requirements). I'm supervising a master thesis, the goal being to study the properties of those methods (so the package is likely to evolve over time).

@danielhomola
Copy link
Collaborator

That's a nice package @ThomasBury, wish you all the best with it! I much prefer separating these ideas from the original implementation to keep things simple and closer to unix tooling philosophy.. I'll close this PR now if you don't mind.

@ThomasBury
Copy link
Author

That's a nice package @ThomasBury, wish you all the best with it! I much prefer separating these ideas from the original implementation to keep things simple and closer to unix tooling philosophy.. I'll close this PR now if you don't mind.

Ok thanks for the kind words. Would you be interested in a PR in pure sklearn (so only sklearn estimators, native and permutation importance)? That would mean the package to be more than boruta, a sklearn-flavoured all relevant FS package, instead of the opposite strategy of having different packages and a wrapper on top of them, if compliant with sk-contrib requirements? Or perhaps just the PR with the permutation importance (slower but more accurate and still relevant for small/mid-size data sets)?
If not, no worries (I prefer having your opinion before starting anything ^^)

@brunofacca
Copy link

Thank you, @ThomasBury! That's a very nice library and it's is likely to fill what I consider to be a gap in the current ML tooling: I've tried dozens of feature selection strategies (including those that are considered "state of the art") and none of them were effective for a high-dimensional dataset where even the most relevant features are quite noisy. Your strategy of running XGBoost classifier on the entire data set ten times (on BoostARoota), for example, is a nice step towards a more effective feature selection strategy for data with a low signal-to-noise ratio.

Would be very interesting to eventually see a comparison between classification performance with subsets of features selected by each of those 3 strategies. I will also give them a try in the near future.

@danielhomola
Copy link
Collaborator

Hi Thomas, really sorry, totally forgot about your question. I'd be happy to review PR with permutation importance if

  • we add it as a difference from the original paper in the readme
  • if it's optional and easy to switch off
  • if we at least have a notebook on some toy data (iris or whatever) with the original and your version and show the difference..

@ThomasBury
Copy link
Author

ThomasBury commented Aug 27, 2021

Hi Thomas, really sorry, totally forgot about your question. I'd be happy to review PR with permutation importance if

  • we add it as a difference from the original paper in the readme
  • if it's optional and easy to switch off
  • if we at least have a notebook on some toy data (iris or whatever) with the original and your version and show the difference..

Hi @danielhomola,

I submitted a PR:

  • optional pimp via an argument
  • If shap is installed (soft dependency, try-catch import) shap feature importance can be used. Not required for installing boruta
  • no hard dependencies added, all optional and compatibility check using try-catch on import + error message
  • chart like the original, only if matplotlib is installed (try-catch import, mpl is not mandatory for installing boruta)
  • add sample_weight support
  • categorical feature support (based on the lightGBM recommendation)
  • Notebook with examples, with genuine predictors and artificial ones to check if the right ones are accepted/rejected
  • Message added to the Readme.

With this, you pretty much have the same version than in the ARFS package but without any hard dependencies.

KR
Thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants