Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

count(s) and term_frequency(s) #92

Merged
merged 11 commits into from
Jul 22, 2020

Conversation

ishanarora04
Copy link
Contributor

Replace term frequency by Count and creates a new method term_frequency

@ishanarora04
Copy link
Contributor Author

@jbesomi Can I have a review ?

@vidyap-xgboost
Copy link
Contributor

@jbesomi Can I have a review ?

Seems like its failing the tests.sh. Have you downloaded the dev-dependencies and ran the tests.sh?

.....................................................................................................
======================================================================
FAIL: test_correct_index_26_term_frequency (tests.test_indexes.AbstractIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/parameterized/parameterized.py", line 530, in standalone_func
    return func(*(a + p.args), **p.kwargs)
  File "/home/travis/build/jbesomi/texthero/tests/test_indexes.py", line 96, in test_correct_index
    self.assertTrue(result_s.index.equals(t_same_index.index))
AssertionError: False is not true
======================================================================
FAIL: test_incorrect_index_26_term_frequency (tests.test_indexes.AbstractIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/parameterized/parameterized.py", line 530, in standalone_func
    return func(*(a + p.args), **p.kwargs)
  File "/home/travis/build/jbesomi/texthero/tests/test_indexes.py", line 103, in test_incorrect_index
    self.assertFalse(result_s.index.equals(t_different_index.index))
AssertionError: True is not false
----------------------------------------------------------------------

@jbesomi
Copy link
Owner

jbesomi commented Jul 15, 2020

Thank you, that looks a great start!

@henrifroese is working on #90, we should probably wait for his merge before we can continue with that. For a broader view, you can have a look there: #85

As you have some Javascript knowledge (and I assume also some web-development knowledge), would you be interested in helping out with #40 ? I can support you there. This is quite an interesting subject, as we will have to work with Sphinx, CSS, html and maybe even a bit of JS

Otherwise, if you are more interested in the software development, what about #65 ?

@ishanarora04
Copy link
Contributor Author

Hey @jbesomi Yes, I can work with both. Should I proceed with this PR? I have to add just a couple of test cases to fix this.

@jbesomi
Copy link
Owner

jbesomi commented Jul 15, 2020

For this PR, you will need to wait a bit. I'm making some important changes. I will let you know when is done (in an hour or so).

If you want to keep going in the meantime, may you have a look there and add ask for questions/add your opinion? #40

@jbesomi jbesomi marked this pull request as draft July 15, 2020 11:59
@jbesomi
Copy link
Owner

jbesomi commented Jul 15, 2020

Hey @ishanarora04, you can keep working on it! For you to know, once you are finished, I will try to uniform all three functions so that they have all the same arguments (min_df, max_df and max_features). You probably will have to redo the work on the new version of the files, it might be just faster!

@ishanarora04
Copy link
Contributor Author

@jbesomi Should term_frequency be part of test_indexes.py since we are generating a new series?

@ishanarora04 ishanarora04 marked this pull request as ready for review July 15, 2020 21:40
@jbesomi
Copy link
Owner

jbesomi commented Jul 15, 2020

Hey!

Yes it should be. What do you mean by "since we are generating a new series?"

Also, term_frequency is basically count normalized by the number of words in the document. You are doing something different right now... isn't?
term frequency formula

@ishanarora04 ishanarora04 force-pushed the ishan_count_term_frequency branch from 07487ca to 98d45a1 Compare July 18, 2020 05:48
@ishanarora04 ishanarora04 marked this pull request as ready for review July 18, 2020 05:53
@ishanarora04
Copy link
Contributor Author

Hey, Can I have a review here ?

@jbesomi
Copy link
Owner

jbesomi commented Jul 18, 2020

Hey @ishanarora04, thanks, amazing!

I'm out of town for the weekend, I will look into that tomorrow evening (ECT) or latest on Monday.

If you are interested in contributing more, this issue needs some help #65 👍

Thank you for your patience!

@ishanarora04
Copy link
Contributor Author

ishanarora04 commented Jul 20, 2020

This is up for review.

@jbesomi
Copy link
Owner

jbesomi commented Jul 20, 2020

Hey @ishanarora04, thank you! 🎉

When reviewing, I noticed that the tokenize function was splitting by _ (underscore). This is clearly not what we expected. I just fixed this issue in 448d40f .

@henrifroese and @mk2510 recently made some big improvements to the documentation and the preprocessing file. You can see the main changes there: #107

If you are okay with that, I will first merge PR #107 and then will merge yours, that way we can reduce merging conflicts.

Review for now (for efficiency and to avoid do the work twice, you might want to wait that we merge #107 before doing any changes):

  1. Congrats! That's super!
  2. new count: you added two parameters (min_df and max_df) but you don't add and explain them in the docstring.
  3. To return the features_names:s -- s after the semicolon

For involving you in the discussion:

Initially, preprocessing functions were receiving a Text Series and then scikit-learn default settings were used. scikit-learn by default lowercase and remove punctuation, that's why we added test such as *_punctuation_are_kept and *_not_lowercase. Now, tfidf and company receives as argument an already tokenized series. This means that unit-test are probably not strictly necessary anymore. It's up to you if you want to keep it, remove it or merge it in a single unit-test (👍 ). Independently of the decision you take, we will need to update the unit-tests in tfidf, count and term_frequency (on a separate PR)

@ishanarora04
Copy link
Contributor Author

Thanks for the review. Yes, we can wait for #107 to be merged. I will inculcate all suggestions

@ishanarora04
Copy link
Contributor Author

Incorporated the suggestions. Meanwhile, I can start working on #65

@jbesomi
Copy link
Owner

jbesomi commented Jul 20, 2020

Top! 🎉
We will just wait for #107, check there are no conflicts, and finally merge! :)

@jbesomi jbesomi merged commit 57e14f5 into jbesomi:master Jul 22, 2020
@jbesomi
Copy link
Owner

jbesomi commented Jul 22, 2020

Hey @ishanarora04. Thank you for your PR! Just merged 🎉 🎉

@ishanarora04
Copy link
Contributor Author

ishanarora04 commented Jul 22, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

count(s) and term_frequency(s)
3 participants