Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transform_feature_names for decomposition and scaling #208

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

jnothman
Copy link
Contributor

Present top linear weights as PCA2:=(x0*5.3+x1*77) etc.

Issues:

  • others may feel this naming convention is too verbose. Suggestions welcome
  • make_tfn_weighted should probably have a doctest (but I'm not sure how to do this without modifying the global register of transform_feature_names implementations)
  • tests may be too slow, particularly those using hypothesis. Suggestions welcome.

@jnothman
Copy link
Contributor Author

In merging with master, I see that there has been an attempt to more thoroughly test every estimator with transform_feature_names. Where multiple are registered identically, I believe this is akin to testing every value in a for loop and is excessive. I would rather leave the tests readable.

@codecov-io
Copy link

codecov-io commented May 23, 2017

Codecov Report

Merging #208 into master will decrease coverage by 11.19%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master    #208      +/-   ##
=========================================
- Coverage    97.4%   86.2%   -11.2%     
=========================================
  Files          41      41              
  Lines        2580    2617      +37     
  Branches      495     505      +10     
=========================================
- Hits         2513    2256     -257     
- Misses         35     329     +294     
  Partials       32      32
Impacted Files Coverage Δ
eli5/sklearn/transform.py 100% <100%> (ø) ⬆️
eli5/xgboost.py 4.09% <0%> (-95.33%) ⬇️
eli5/lightgbm.py 4.2% <0%> (-94.98%) ⬇️
eli5/_decision_path.py 43.47% <0%> (-56.53%) ⬇️
eli5/_feature_names.py 94.33% <0%> (-2.84%) ⬇️
eli5/sklearn/utils.py 88.05% <0%> (-1.8%) ⬇️
eli5/sklearn/explain_prediction.py 97.74% <0%> (-0.13%) ⬇️
eli5/sklearn/explain_weights.py 100% <0%> (ø) ⬆️
eli5/formatters/html.py 100% <0%> (ø) ⬆️

@kmike
Copy link
Contributor

kmike commented May 24, 2017

Hey @jnothman! I'll check this PR later today.

As for excessive testing, I like testing of all (most) estimators explicitly because of two reasons:

  1. this way we declare explicitly what's supported and what's not supported;
  2. it helps to find bugs and wrong assumptions.

For example, RandomizedLogisticRegression has SelectorMixin as a metaclass, not as a base class, so #177 didn't work for it; I found this when adding all these estimators (which I was thinking are supported) one-by-one. Another example is lightning support: when adding all estimators one-by-one I found several inconsistencies which required fixing in lightning: scikit-learn-contrib/lightning#95, scikit-learn-contrib/lightning#98.

I can see how from a library author (sklearn, etc.) point of view these tests look unnecessary. But they require knowledge of inner workings of the library, and knowledge of development plans - e.g. SelectorMixin is not documented, it is not a part of public scikit-learn API; as a scikit-learn core developer you are comfortable relying on it, but as an outsider I like to be more explicit in what eli5 supports, and check one more time that the library interface is consistent and there are no gotchas and hidden issues. In case of #177 it was also a learning excercise for me - I was checking what this PR means for the user.

@jnothman
Copy link
Contributor Author

jnothman commented May 24, 2017 via email

@kmike
Copy link
Contributor

kmike commented May 24, 2017

A good catch about randomized l1!

Yeah, deprecations can be a problem. We "solved" it once by dropping sklearn 0.17 support, and likely it'll cause problems again with 0.19. But to my taste it is a better problem to have: it is testing-only, and it is visible; if we don't test then problem can be in user applications, and it can be undetected by our testing suite. But yeah, explicit estimator support increases effort required for maintenance.

@jnothman
Copy link
Contributor Author

I've not worked out how to fix mypy's complaint that

eli5/sklearn/transform.py:176: error: Argument 1 has incompatible type "make_tfn_weighted"; expected Callable[..., Any]

@kmike
Copy link
Contributor

kmike commented May 24, 2017

@jnothman yep, mypy still has a lot of rough edges - python/mypy#797. You can ignore such errors like this:

transform_feature_names.register(TfidfTransformer)(  # type: ignore
    make_tfn_weighted('idf_', func_name='TFIDF', show_idx=False))

@kmike
Copy link
Contributor

kmike commented May 29, 2017

Hey @jnothman,

I've tried it with LSI:

vec = CountVectorizer()
tfidf = TfidfTransformer()
svd = TruncatedSVD(n_components=50)
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, tfidf, svd, clf)

eli5.explain_weigths shows features like this now:

SVD2:=(TFIDF(keith*3.75)*0.362+TFIDF(caltech*3.99)*0.27+TFIDF(livesey*4.55)*0.223)

Unlike TfidfVectorizer, it shows idf weights, which is nice; I wonder if we should use idf_ as coef_scale by default for TfidfVectorizer.

What I found confusing is TFIDF(keith*3.75) notation: it reads like "take a number of words "keith" in document, multiply by 3.75, then compute TF*IDF for it (?)", while in fact IDF is 3.75, so it is TF(keith)*3.75. If the goal is not to show a math formula, but to show transformer name, then what do you think about using some other notation, e.g. TFIDF[...] or something else. Ideas?

@jnothman
Copy link
Contributor Author

jnothman commented May 29, 2017 via email

@kmike
Copy link
Contributor

kmike commented Jun 21, 2017

FTR: tests are failing because of #217.

@jnothman jnothman force-pushed the transform-weighted branch from 6ee3bd5 to 0e34608 Compare June 25, 2017 21:44
@jnothman
Copy link
Contributor Author

I'd really like to hear some feedback on this, as I feel linear transformation is a very common case, but it is admittedly hard to present nicely as a string.

@jnothman
Copy link
Contributor Author

Otoh if you feel like it's stretching the scope of eli5 too wide, I'm happy to make a separate project of it.

@kmike
Copy link
Contributor

kmike commented Jun 29, 2017

@jnothman this PR looks great, I'm looking forward to using it myself. My only reservation is TF(...) which looks like a function is applied. It seems the easiest way to fix it is to make formatting in make_tfn_weighted customizable, and customize it for TfidfTransformer.

@jnothman
Copy link
Contributor Author

jnothman commented Jun 29, 2017 via email

@kmike
Copy link
Contributor

kmike commented Jun 29, 2017

@jnothman yes, sounds fine.

I tried <> as well as few other options ({}, []), they all have disadvantages - {} looks like a set notation, [] looks like we're showing a vector, <> may be confused with greater/less signs. But <> looks less ambiguous, it is no worse than alternatives:

SVD2:=<TFIDF<keith*3.75>*0.362+TFIDF<caltech*3.99>*0.27+TFIDF<livesey*4.55>*0.223>
SVD2:={TFIDF{keith*3.75}*0.362+TFIDF{caltech*3.99}*0.27+TFIDF{livesey*4.55}*0.223}
SVD2:=[TFIDF[keith*3.75]*0.362+TFIDF[caltech*3.99]*0.27+TFIDF[livesey*4.55]*0.223]
SVD2:=(TFIDF(keith*3.75)*0.362+TFIDF(caltech*3.99)*0.27+TFIDF(livesey*4.55)*0.223)

PCA2:={x0*5.3+x1*77}
PCA2:=<x0*5.3+x1*77>
PCA2:=[x0*5.3+x1*77]
PCA2:=(x0*5.3+x1*77)

Maybe spaces around + can help a bit with readability, what do you think?

SVD2:=<TFIDF<keith*3.75>*0.362 + TFIDF<caltech*3.99>*0.27 + TFIDF<livesey*4.55>*0.223>
PCA2:=<x0*5.3 + x1*77>

@kmike
Copy link
Contributor

kmike commented Jun 29, 2017

Does it make sense to keep () for := notation?

SVD2:=(TFIDF<keith*3.75>*0.362 + TFIDF<caltech*3.99>*0.27 + TFIDF<livesey*4.55>*0.223)
PCA2:=(x0*5.3 + x1*77)

Also, should we indicate that some terms are missing (if they are missing), e.g. PCA2:=(x0*5.3 + x1*77 + …)?

@jnothman
Copy link
Contributor Author

jnothman commented Jun 29, 2017 via email

@kmike
Copy link
Contributor

kmike commented Jun 30, 2017

What do you think about the following?

  1. add spaces around + signs;
  2. add '...' if there are missing terms in a sum, e.g. PCA2:=(x0*5.3 + x1*77 + ...);
  3. keep () braces for everything except for tf*idf;
  4. change tf*idf formatting. Options: TFIDF/keith*3.75/, TF(keith)*3.75, TF<keith*3.5>, TFIDF[keith*3.75], TFIDF:(keith*3.75).

(4) is what I'm less certain in. Maybe we should just pick one of the options even if it is not perfect to move it forward; it can be fixed later if someone comes with a better idea.

@jnothman
Copy link
Contributor Author

jnothman commented Jul 9, 2017

TF(keith)*3.75 would make sense... But maybe it belongs in a different function and I've loaded too much on here.

@jnothman
Copy link
Contributor Author

jnothman commented Jul 9, 2017

With the ... idea, do you want to see this ellipsis only when top is the reason for limiting results, or even when threshold is limiting results?

@jnothman
Copy link
Contributor Author

jnothman commented Jul 9, 2017

I've not put in ellipsis previously because I considered brevity to be a virtue in feature names.

@kmike
Copy link
Contributor

kmike commented Jul 9, 2017

@jnothman Hm, I was thinking about both top and threshold. We shouldn't show "+ …" if the rest of features are zero though, and if you think of threshold as of a way to filter out near-zero components I see how it can be different from top.

PCA2:=(x0*0.325 + x1*0.221 + …) is a bit longer, but e.g. for

SVD2:=(TFIDF(keith)*3.75*0.362 + TFIDF(caltech)*3.99*0.27 + TFIDF(livesey)*4.55*0.223 + …)

it doesn't matter.

It bothers me a bit that these are displaying options, so they'd be better handled at formatting level. For example, the rest of terms in SVD can be displayed on mouse hover in HTML. But it looks outside the scope of this pull request; it may require a structured representation of feature names, instead of plain strings everywhere.

We had this issue with HashingVectorizer, where we needed to store signs of possible feature names, in addition to names themselves; that's why formatters accept feature names which are lists of {'name': name, 'sign': sign} dicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants