Optimizations to tfidf backend training #335

osma · 2019-10-04T10:55:30Z

This PR contains several different optimizations meant to speed up training of tfidf models, which are currently quite slow to train:

When converting the corpus to document-oriented (one line/file per document) to subject-oriented (one file per subject) format, delay the writes by buffering some data before writing it. This drastically reduces the number of required I/O operations.
Transform the corpus to subject-oriented before building the TfidfVectorizer, instead of doing it implicitly as part of the vectorization. I'm not sure why but this is much faster.
Add some caching to the Analyzer.is_valid_token method, which gets called a lot.
Move the subject vectorizer creation from the Project class to the TfidfBackend class. Originally I envisioned that there would be other backends that require a subject-oriented vectorizer, but now I don't think those are likely to appear. The LSI backend (LSI backend #201/First implementation of LSI backend, with tests. Fixes #201 #219) was a total failure in terms of quality of results.
Switch from TfidfVectorizer.fit and .transform to the combined .fit_and_transform method. This means that the corpus only has to be processed once, which halves the time spent on analyzing text.

Overall these changes speed up the training of tfidf projects by at least 2x, perhaps more in some cases.

I have some ideas on how to further improve performance of the tfidf backend, but I'll leave those for another PR.

reasons I don't quite understand this brings a small performance boost

…ckend needs it and it is unlikely other backends will need it in the future

…te fit and transform steps

codecov · 2019-10-04T11:05:54Z

Codecov Report

Merging #335 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #335      +/-   ##
==========================================
- Coverage   99.55%   99.52%   -0.03%     
==========================================
  Files          56       56              
  Lines        3150     3166      +16     
==========================================
+ Hits         3136     3151      +15     
- Misses         14       15       +1

Impacted Files	Coverage Δ
annif/project.py	`100% <ø> (ø)`	⬆️
annif/backend/backend.py	`100% <ø> (ø)`	⬆️
annif/corpus/convert.py	`100% <100%> (ø)`	⬆️
tests/test_project.py	`100% <100%> (ø)`	⬆️
annif/analyzer/analyzer.py	`100% <100%> (ø)`	⬆️
annif/backend/tfidf.py	`98.14% <100%> (-1.86%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 244db95...0c6ee3b. Read the comment docs.

osma · 2019-10-04T11:21:11Z

There are some complaints from the QA tools about the _generate_corpus_from_documents method, but I will ignore these for now. The corpus conversion will probably be dealt with in a subsequent PR, as I intend to integrate it into the tfidf backend code and make it more efficient.

lgtm-com · 2019-10-04T11:31:23Z

This pull request introduces 1 alert when merging 0c6ee3b into 244db95 - view on LGTM.com

new alerts:

1 for Unused import

osma added 7 commits September 26, 2019 15:14

Speed up conversion of document to subject corpus using buffered writing

a4687d4

Also delay file creation when buffering corpus conversion

c01c88d

pre-transform document corpus to subject corpus before vectorizing - for

466dd5f

reasons I don't quite understand this brings a small performance boost

add a small cache to Analyzer.is_valid_token() to improve performance

ea6dbeb

Merge branch 'master' into buffered-subject-conversion

be2c44c

Move subject vectorizer handling inside tfidf backend, as no other ba…

65033fa

…ckend needs it and it is unlikely other backends will need it in the future

Use TfidfVectorizer.fit_transform as it is more efficient than separa…

b81f0c7

…te fit and transform steps

osma added the enhancement label Oct 4, 2019

osma added this to the 0.43 milestone Oct 4, 2019

osma added 3 commits October 4, 2019 14:02

Split up TFIDFBackend.initialize

6f49b21

use named constant instead of hardcoded value

b6dbbcd

Split up TFIDFBackend.train

0c6ee3b

osma merged commit 8e90e24 into master Oct 4, 2019

osma deleted the buffered-subject-conversion branch December 13, 2019 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations to tfidf backend training #335

Optimizations to tfidf backend training #335

osma commented Oct 4, 2019

codecov bot commented Oct 4, 2019 •

edited

Loading

osma commented Oct 4, 2019

lgtm-com bot commented Oct 4, 2019

Optimizations to tfidf backend training #335

Optimizations to tfidf backend training #335

Conversation

osma commented Oct 4, 2019

codecov bot commented Oct 4, 2019 • edited Loading

Codecov Report

osma commented Oct 4, 2019

lgtm-com bot commented Oct 4, 2019

codecov bot commented Oct 4, 2019 •

edited

Loading