Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further optimizations to tfidf backend + rearchitecting #336

Merged
merged 6 commits into from
Oct 7, 2019

Conversation

osma
Copy link
Member

@osma osma commented Oct 4, 2019

This PR changes quite a few things related to corpus handling. The ultimate goal is to optimize the tfidf backend training process, but on the way lots of things got cleaned up and streamlined.

The general functionality for converting between document-oriented and subject-oriented corpora has been removed, as only the tfidf backend really requires it. The code to do that conversion now belongs to the tfidf backend only, and it has been rewritten to avoid tokenizing the same text multiple times.

There should be a significant improvement in train time: the new code is about twice as fast as the old one, or 5x as fast as the previous release.

TfidfTransformer, to avoid tokenizing the same text many times if it has
multiple subjects
@codecov
Copy link

codecov bot commented Oct 4, 2019

Codecov Report

Merging #336 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #336      +/-   ##
==========================================
- Coverage   99.52%   99.51%   -0.02%     
==========================================
  Files          56       55       -1     
  Lines        3166     3080      -86     
==========================================
- Hits         3151     3065      -86     
  Misses         15       15
Impacted Files Coverage Δ
tests/test_backend_tfidf.py 100% <100%> (ø) ⬆️
tests/test_suggestion.py 100% <100%> (ø) ⬆️
annif/corpus/document.py 100% <100%> (ø) ⬆️
tests/test_corpus.py 100% <100%> (ø) ⬆️
annif/corpus/combine.py 100% <100%> (ø) ⬆️
annif/corpus/skos.py 95.45% <100%> (ø) ⬆️
annif/backend/tfidf.py 98.94% <100%> (+0.79%) ⬆️
annif/corpus/subject.py 100% <100%> (ø) ⬆️
tests/conftest.py 100% <100%> (ø) ⬆️
annif/corpus/__init__.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e90e24...3238047. Read the comment docs.

@lgtm-com
Copy link

lgtm-com bot commented Oct 4, 2019

This pull request introduces 3 alerts when merging 3cc830b into 8e90e24 - view on LGTM.com

new alerts:

  • 3 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Oct 7, 2019

This pull request introduces 4 alerts when merging fe4bdc5 into 8e90e24 - view on LGTM.com

new alerts:

  • 4 for Unused import

@osma osma merged commit 462165f into master Oct 7, 2019
@osma osma deleted the tfidf-optimizations branch October 7, 2019 12:01
@osma osma added this to the 0.43 milestone Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant