Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations to tfidf backend training #335

Merged
merged 10 commits into from
Oct 4, 2019
Merged

Conversation

osma
Copy link
Member

@osma osma commented Oct 4, 2019

This PR contains several different optimizations meant to speed up training of tfidf models, which are currently quite slow to train:

  1. When converting the corpus to document-oriented (one line/file per document) to subject-oriented (one file per subject) format, delay the writes by buffering some data before writing it. This drastically reduces the number of required I/O operations.
  2. Transform the corpus to subject-oriented before building the TfidfVectorizer, instead of doing it implicitly as part of the vectorization. I'm not sure why but this is much faster.
  3. Add some caching to the Analyzer.is_valid_token method, which gets called a lot.
  4. Move the subject vectorizer creation from the Project class to the TfidfBackend class. Originally I envisioned that there would be other backends that require a subject-oriented vectorizer, but now I don't think those are likely to appear. The LSI backend (LSI backend #201/First implementation of LSI backend, with tests. Fixes #201 #219) was a total failure in terms of quality of results.
  5. Switch from TfidfVectorizer.fit and .transform to the combined .fit_and_transform method. This means that the corpus only has to be processed once, which halves the time spent on analyzing text.

Overall these changes speed up the training of tfidf projects by at least 2x, perhaps more in some cases.

I have some ideas on how to further improve performance of the tfidf backend, but I'll leave those for another PR.

@osma osma added this to the 0.43 milestone Oct 4, 2019
@codecov
Copy link

codecov bot commented Oct 4, 2019

Codecov Report

Merging #335 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #335      +/-   ##
==========================================
- Coverage   99.55%   99.52%   -0.03%     
==========================================
  Files          56       56              
  Lines        3150     3166      +16     
==========================================
+ Hits         3136     3151      +15     
- Misses         14       15       +1
Impacted Files Coverage Δ
annif/project.py 100% <ø> (ø) ⬆️
annif/backend/backend.py 100% <ø> (ø) ⬆️
annif/corpus/convert.py 100% <100%> (ø) ⬆️
tests/test_project.py 100% <100%> (ø) ⬆️
annif/analyzer/analyzer.py 100% <100%> (ø) ⬆️
annif/backend/tfidf.py 98.14% <100%> (-1.86%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 244db95...0c6ee3b. Read the comment docs.

@osma
Copy link
Member Author

osma commented Oct 4, 2019

There are some complaints from the QA tools about the _generate_corpus_from_documents method, but I will ignore these for now. The corpus conversion will probably be dealt with in a subsequent PR, as I intend to integrate it into the tfidf backend code and make it more efficient.

@osma osma merged commit 8e90e24 into master Oct 4, 2019
@lgtm-com
Copy link

lgtm-com bot commented Oct 4, 2019

This pull request introduces 1 alert when merging 0c6ee3b into 244db95 - view on LGTM.com

new alerts:

  • 1 for Unused import

@osma osma deleted the buffered-subject-conversion branch December 13, 2019 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant