Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tfidf(s): remove normalization, improve docstring #76

Closed
jbesomi opened this issue Jul 13, 2020 · 4 comments
Closed

tfidf(s): remove normalization, improve docstring #76

jbesomi opened this issue Jul 13, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@jbesomi
Copy link
Owner

jbesomi commented Jul 13, 2020

The tfidf function, under-the-hoods makes use of the sklearn.feature_extraction.text.TfidfVectorizer.

By default, the TfidfVectorizer return a L2-normalized TF-IDF vector. In Texthero, we would like to avoid this hidden behavior, to let the user have more granular control over each action.

This task consists of removing (and testing) the normalization, as well as make it more clear in the docstring which tfidf-formula is used. For extra information, the chapter text feature extraction of the scikit-learn documentation might be useful.

Task:

  1. Remove L2-normalization.
  2. Test the new function (using the mathematical formula for each value)
  3. Improve the docstring making it clear which TF-IDF function is used

Implementation:

There are probably two ways of solve that:

  1. By using TfidfVectorizer and de-l2-normalizing the output
  2. Using pure NumPy. This would be nicer from a code perspective but it might be tricky as working with NumPy vectors is not always trivial. As the result should be a Sparse Matrix (see Support "Pandas Series Representation" #43 ) this mean we will have to deal with sparse matrix.

Extra:
After having implemented #43, we will add two new functions, L1 and L2, to normalize any "Pandas Representation Series"

@jbesomi jbesomi changed the title Imrpviving: tfidf(s) remove normalization, improve docstring Improving tfidf(s) remove normalization, improve docstring Jul 14, 2020
@jbesomi jbesomi pinned this issue Jul 14, 2020
@jbesomi jbesomi changed the title Improving tfidf(s) remove normalization, improve docstring tfidf(s): remove normalization, improve docstring Jul 14, 2020
@henrifroese
Copy link
Collaborator

Working on this with @mk2510 🐐

@jbesomi jbesomi added the enhancement New feature or request label Jul 15, 2020
@jbesomi jbesomi unpinned this issue Jul 15, 2020
@henrifroese
Copy link
Collaborator

Should the new tfidf function already return such a Representation Series?

Working on it, we currently implemented it so that it calculates the Representation Series correctly (using some of your code from #43 ) and then use a function representation_series_to_flat_series to make the output the same as tfidf currently produces. Thus, when #43 will be fully implemented later on, in tfidf only one line has to be changed to get the Representation Series output. But we could also already return the Representation Series output now.

@jbesomi
Copy link
Owner Author

jbesomi commented Jul 15, 2020

Working on it, we currently implemented it so that it calculates the Representation Series correctly (using some of your code from #43 ) and then use a function representation_series_to_flat_series to make the output the same as tfidf currently produces. Thus, when #43 will be fully implemented later on, in tfidf only one line has to be changed to get the Representation Series output. But we could also already return the Representation Series output now.

representation_series_to_flat_series: that's sounds great! (this function will be super useful anyway for users to move from the two representation). By the way, we might want to figure out two different XSeries names to differentiate between the two representations. (Also, for instance, pca will still return a Pandas Series where each cell is a list, right?)

Should the new tfidf function already return such a Representation Series?

What's your opinion? We might also want to do both tasks 1 and 2 in #84 before releasing a new version. This is fine as well. In this case, we can already return the "new" Pandas Representation but this means that many tests will fail ...

... have a look at the unit tests to check for the "new" Pandas Representation under #43 (it took me a long while to implement that, it might spare you a bit of time too!)

henrifroese added a commit to SummerOfCode-NoHate/texthero that referenced this issue Jul 15, 2020
Docstring now includes formula/explaination.

Normalization disabled.

Representation Series is already being handled (although output
is still like before).

Function representation_series_to_flat_series added.

Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>
henrifroese added a commit to SummerOfCode-NoHate/texthero that referenced this issue Jul 15, 2020
Docstring now includes formula/explaination.

Normalization disabled (the option "normalization=None" was "hidden" in the sklearn code, so that turned out to be an easy fix).

Representation Series is already being handled (although output is still like before, using representation_series_to_flat_series).

Function representation_series_to_flat_series added.

Unit tests are changed accordingly, also one with the explicit calculation using the formula.

Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>
@henrifroese
Copy link
Collaborator

We just opened a PR at #97 .

By the way, we might want to figure out two different XSeries names to differentiate between the two representations

Yes, that is very important to discuss, and probably should be handled later on in #85 2.

What's your opinion? We might also want to do both tasks 1 and 2 in #84 before releasing a new version.

I think it makes sense to still return a "old" output at the moment, so that's what we did currently. It probably makes sense to overhaul all the representation functions at the same time in #85 2. As the new tfidf already calculates with the representation series and just gives the old output, it will be easy to change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants