Further optimizations to tfidf backend + rearchitecting #336

osma · 2019-10-04T14:57:25Z

This PR changes quite a few things related to corpus handling. The ultimate goal is to optimize the tfidf backend training process, but on the way lots of things got cleaned up and streamlined.

The general functionality for converting between document-oriented and subject-oriented corpora has been removed, as only the tfidf backend really requires it. The code to do that conversion now belongs to the tfidf backend only, and it has been rewritten to avoid tokenizing the same text multiple times.

There should be a significant improvement in train time: the new code is about twice as fast as the old one, or 5x as fast as the previous release.

… only used by a unit test, not real code

…FIDFBackend since nothing else uses it

…tDirectory class

TfidfTransformer, to avoid tokenizing the same text many times if it has multiple subjects

codecov · 2019-10-04T15:01:58Z

Codecov Report

Merging #336 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #336      +/-   ##
==========================================
- Coverage   99.52%   99.51%   -0.02%     
==========================================
  Files          56       55       -1     
  Lines        3166     3080      -86     
==========================================
- Hits         3151     3065      -86     
  Misses         15       15

Impacted Files	Coverage Δ
tests/test_backend_tfidf.py	`100% <100%> (ø)`	⬆️
tests/test_suggestion.py	`100% <100%> (ø)`	⬆️
annif/corpus/document.py	`100% <100%> (ø)`	⬆️
tests/test_corpus.py	`100% <100%> (ø)`	⬆️
annif/corpus/combine.py	`100% <100%> (ø)`	⬆️
annif/corpus/skos.py	`95.45% <100%> (ø)`	⬆️
annif/backend/tfidf.py	`98.94% <100%> (+0.79%)`	⬆️
annif/corpus/subject.py	`100% <100%> (ø)`	⬆️
tests/conftest.py	`100% <100%> (ø)`	⬆️
annif/corpus/__init__.py	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e90e24...3238047. Read the comment docs.

lgtm-com · 2019-10-04T15:25:23Z

This pull request introduces 3 alerts when merging 3cc830b into 8e90e24 - view on LGTM.com

new alerts:

3 for Unused import

… memory

lgtm-com · 2019-10-07T10:40:01Z

This pull request introduces 4 alerts when merging fe4bdc5 into 8e90e24 - view on LGTM.com

new alerts:

4 for Unused import

osma added 4 commits October 4, 2019 16:33

Remove dead code: conversion from SubjectCorpus to DocumentCorpus was…

01d1c43

… only used by a unit test, not real code

Move the conversion from document to subject corpus entirely inside T…

987c6dd

…FIDFBackend since nothing else uses it

Perform document to subject conversion in memory; remove stale Subjec…

2cb42da

…tDirectory class

Tokenize text during conversion to subject corpus instead of within

3cc830b

TfidfTransformer, to avoid tokenizing the same text many times if it has multiple subjects

osma added the enhancement label Oct 4, 2019

Spool large subject texts into files instead of keeping everything in…

fe4bdc5

… memory

Cleanup unused imports

3238047

osma merged commit 462165f into master Oct 7, 2019

osma deleted the tfidf-optimizations branch October 7, 2019 12:01

osma added this to the 0.43 milestone Oct 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further optimizations to tfidf backend + rearchitecting #336

Further optimizations to tfidf backend + rearchitecting #336

osma commented Oct 4, 2019 •

edited

Loading

codecov bot commented Oct 4, 2019 •

edited

Loading

lgtm-com bot commented Oct 4, 2019

lgtm-com bot commented Oct 7, 2019

Further optimizations to tfidf backend + rearchitecting #336

Further optimizations to tfidf backend + rearchitecting #336

Conversation

osma commented Oct 4, 2019 • edited Loading

codecov bot commented Oct 4, 2019 • edited Loading

Codecov Report

lgtm-com bot commented Oct 4, 2019

lgtm-com bot commented Oct 7, 2019

osma commented Oct 4, 2019 •

edited

Loading

codecov bot commented Oct 4, 2019 •

edited

Loading