count(s) and term_frequency(s) #92

ishanarora04 · 2020-07-15T10:47:39Z

Replace term frequency by Count and creates a new method term_frequency

ishanarora04 · 2020-07-15T10:49:03Z

@jbesomi Can I have a review ?

vidyap-xgboost · 2020-07-15T11:00:18Z

@jbesomi Can I have a review ?

Seems like its failing the tests.sh. Have you downloaded the dev-dependencies and ran the tests.sh?

.....................................................................................................
======================================================================
FAIL: test_correct_index_26_term_frequency (tests.test_indexes.AbstractIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/parameterized/parameterized.py", line 530, in standalone_func
    return func(*(a + p.args), **p.kwargs)
  File "/home/travis/build/jbesomi/texthero/tests/test_indexes.py", line 96, in test_correct_index
    self.assertTrue(result_s.index.equals(t_same_index.index))
AssertionError: False is not true
======================================================================
FAIL: test_incorrect_index_26_term_frequency (tests.test_indexes.AbstractIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/parameterized/parameterized.py", line 530, in standalone_func
    return func(*(a + p.args), **p.kwargs)
  File "/home/travis/build/jbesomi/texthero/tests/test_indexes.py", line 103, in test_incorrect_index
    self.assertFalse(result_s.index.equals(t_different_index.index))
AssertionError: True is not false
----------------------------------------------------------------------

jbesomi · 2020-07-15T11:05:09Z

Thank you, that looks a great start!

@henrifroese is working on #90, we should probably wait for his merge before we can continue with that. For a broader view, you can have a look there: #85

As you have some Javascript knowledge (and I assume also some web-development knowledge), would you be interested in helping out with #40 ? I can support you there. This is quite an interesting subject, as we will have to work with Sphinx, CSS, html and maybe even a bit of JS

Otherwise, if you are more interested in the software development, what about #65 ?

ishanarora04 · 2020-07-15T11:19:13Z

Hey @jbesomi Yes, I can work with both. Should I proceed with this PR? I have to add just a couple of test cases to fix this.

jbesomi · 2020-07-15T11:59:48Z

For this PR, you will need to wait a bit. I'm making some important changes. I will let you know when is done (in an hour or so).

If you want to keep going in the meantime, may you have a look there and add ask for questions/add your opinion? #40

jbesomi · 2020-07-15T13:10:38Z

Hey @ishanarora04, you can keep working on it! For you to know, once you are finished, I will try to uniform all three functions so that they have all the same arguments (min_df, max_df and max_features). You probably will have to redo the work on the new version of the files, it might be just faster!

ishanarora04 · 2020-07-15T21:39:05Z

@jbesomi Should term_frequency be part of test_indexes.py since we are generating a new series?

jbesomi · 2020-07-15T22:27:06Z

Hey!

Yes it should be. What do you mean by "since we are generating a new series?"

Also, term_frequency is basically count normalized by the number of words in the document. You are doing something different right now... isn't?
term frequency formula

ishanarora04 · 2020-07-18T05:54:16Z

Hey, Can I have a review here ?

jbesomi · 2020-07-18T06:01:35Z

Hey @ishanarora04, thanks, amazing!

I'm out of town for the weekend, I will look into that tomorrow evening (ECT) or latest on Monday.

If you are interested in contributing more, this issue needs some help #65 👍

Thank you for your patience!

ishanarora04 · 2020-07-20T10:33:53Z

This is up for review.

jbesomi · 2020-07-20T11:52:56Z

Hey @ishanarora04, thank you! 🎉

When reviewing, I noticed that the tokenize function was splitting by _ (underscore). This is clearly not what we expected. I just fixed this issue in 448d40f .

@henrifroese and @mk2510 recently made some big improvements to the documentation and the preprocessing file. You can see the main changes there: #107

If you are okay with that, I will first merge PR #107 and then will merge yours, that way we can reduce merging conflicts.

Review for now (for efficiency and to avoid do the work twice, you might want to wait that we merge #107 before doing any changes):

Congrats! That's super!
new count: you added two parameters (min_df and max_df) but you don't add and explain them in the docstring.
To return the features_names:s -- s after the semicolon

For involving you in the discussion:

Initially, preprocessing functions were receiving a Text Series and then scikit-learn default settings were used. scikit-learn by default lowercase and remove punctuation, that's why we added test such as *_punctuation_are_kept and *_not_lowercase. Now, tfidf and company receives as argument an already tokenized series. This means that unit-test are probably not strictly necessary anymore. It's up to you if you want to keep it, remove it or merge it in a single unit-test (👍 ). Independently of the decision you take, we will need to update the unit-tests in tfidf, count and term_frequency (on a separate PR)

ishanarora04 · 2020-07-20T12:01:56Z

Thanks for the review. Yes, we can wait for #107 to be merged. I will inculcate all suggestions

ishanarora04 · 2020-07-20T14:59:57Z

Incorporated the suggestions. Meanwhile, I can start working on #65

jbesomi · 2020-07-20T16:48:30Z

Top! 🎉
We will just wait for #107, check there are no conflicts, and finally merge! :)

jbesomi · 2020-07-22T08:12:07Z

Hey @ishanarora04. Thank you for your PR! Just merged 🎉 🎉

ishanarora04 · 2020-07-22T09:01:50Z

Thanks

…

On Wed, Jul 22, 2020 at 1:42 PM Jonathan Besomi ***@***.***> wrote: Hey @ishanarora04 <https://github.com/ishanarora04>. Thank you for your PR! Just merged 🎉 🎉 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#92 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABWIJOEW5YRNOWECY6WY3FLR42NONANCNFSM4O2NCJ2A> .

vercel bot deployed to Preview July 15, 2020 10:47 View deployment

jbesomi marked this pull request as draft July 15, 2020 11:59

ishanarora04 force-pushed the ishan_count_term_frequency branch from caee34a to 590bd94 Compare July 15, 2020 21:36

vercel bot deployed to Preview July 15, 2020 21:36 View deployment

ishanarora04 marked this pull request as ready for review July 15, 2020 21:40

vercel bot deployed to Preview July 15, 2020 21:45 View deployment

jbesomi marked this pull request as draft July 16, 2020 10:05

jbesomi linked an issue Jul 16, 2020 that may be closed by this pull request

count(s) and term_frequency(s) #61

Closed

vercel bot deployed to Preview July 17, 2020 21:40 View deployment

vercel bot deployed to Preview July 17, 2020 21:44 View deployment

ishanarora04 added 6 commits July 18, 2020 03:23

Formatted

964e5c0

Fixed Documentation

12215bf

Test Indexes

38be0a7

Fixes

7c0e526

Test Representation

3e76061

Test case refactoring

98d45a1

ishanarora04 force-pushed the ishan_count_term_frequency branch from 07487ca to 98d45a1 Compare July 18, 2020 05:48

vercel bot deployed to Preview July 18, 2020 05:48 View deployment

ishanarora04 marked this pull request as ready for review July 18, 2020 05:53

Reverted

3f20154

vercel bot deployed to Preview July 18, 2020 06:03 View deployment

Incorporate changes

8690fc6

vercel bot deployed to Preview July 20, 2020 12:37 View deployment

Test Representation

650fba5

vercel bot deployed to Preview July 20, 2020 12:39 View deployment

Max DF and min DF added

5c28e0b

vercel bot deployed to Preview July 20, 2020 12:41 View deployment

fix

df24516

vercel bot deployed to Preview July 20, 2020 12:46 View deployment

jbesomi merged commit 57e14f5 into jbesomi:master Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

count(s) and term_frequency(s) #92

count(s) and term_frequency(s) #92

ishanarora04 commented Jul 15, 2020

ishanarora04 commented Jul 15, 2020

vidyap-xgboost commented Jul 15, 2020

jbesomi commented Jul 15, 2020

ishanarora04 commented Jul 15, 2020

jbesomi commented Jul 15, 2020

jbesomi commented Jul 15, 2020

ishanarora04 commented Jul 15, 2020

jbesomi commented Jul 15, 2020 •

edited

Loading

ishanarora04 commented Jul 18, 2020

jbesomi commented Jul 18, 2020

ishanarora04 commented Jul 20, 2020 •

edited

Loading

jbesomi commented Jul 20, 2020

ishanarora04 commented Jul 20, 2020

ishanarora04 commented Jul 20, 2020

jbesomi commented Jul 20, 2020

jbesomi commented Jul 22, 2020

ishanarora04 commented Jul 22, 2020 via email

count(s) and term_frequency(s) #92

count(s) and term_frequency(s) #92

Conversation

ishanarora04 commented Jul 15, 2020

ishanarora04 commented Jul 15, 2020

vidyap-xgboost commented Jul 15, 2020

jbesomi commented Jul 15, 2020

ishanarora04 commented Jul 15, 2020

jbesomi commented Jul 15, 2020

jbesomi commented Jul 15, 2020

ishanarora04 commented Jul 15, 2020

jbesomi commented Jul 15, 2020 • edited Loading

ishanarora04 commented Jul 18, 2020

jbesomi commented Jul 18, 2020

ishanarora04 commented Jul 20, 2020 • edited Loading

jbesomi commented Jul 20, 2020

ishanarora04 commented Jul 20, 2020

ishanarora04 commented Jul 20, 2020

jbesomi commented Jul 20, 2020

jbesomi commented Jul 22, 2020

ishanarora04 commented Jul 22, 2020 via email

jbesomi commented Jul 15, 2020 •

edited

Loading

ishanarora04 commented Jul 20, 2020 •

edited

Loading