Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for batch suggest operations for CLI commands #663

Merged
merged 35 commits into from
Feb 3, 2023

Conversation

juhoinkinen
Copy link
Member

@juhoinkinen juhoinkinen commented Jan 20, 2023

Adds support for passing multiple documents in a batch from suggest, index, eval and optimize CLI commands to backends (like discussed in issue #579).

Makes annif suggest accept path(s) to file(s) to be indexed, in addition to stdin:

annif suggest yso-tfidf-en <document.txt              # just like before
annif suggest yso-tfidf-en document.txt               # the same, but from a named file
annif suggest yso-tfidf-en doc1.txt doc2.txt doc3.txt # many files
annif suggest yso-tfidf-en doc*.txt                   # similar to above, but using shell expansion
annif suggest yso-tfidf-en -                          # stdin with dash
annif suggest yso-tfidf-en doc1.txt -                 # mixing file and stdin with dash

The output for each file path input begins with line Suggestions for <file/path>:

Suggestions for tests/corpora/archaeology/fulltext/440866.txt
<http://www.yso.fi/onto/yso/p6218>	riimukirjoitus	0.3213897943496704
<http://www.yso.fi/onto/yso/p6479>	viikingit	0.18659920990467072
<http://www.yso.fi/onto/yso/p12738>	viikinkiaika	0.18625082075595856
<http://www.yso.fi/onto/yso/p22768>	Kiinan muuri	0.15950888395309448
<http://www.yso.fi/onto/yso/p3973>	antiikki	0.13840530812740326
<http://www.yso.fi/onto/yso/p14588>	riimukivet	0.1362432837486267
<http://www.yso.fi/onto/yso/p14173>	kaivaukset	0.1201547235250473
<http://www.yso.fi/onto/yso/p5713>	hautalöydöt	0.11249098181724548
<http://www.yso.fi/onto/yso/p15031>	viikinkiretket	0.11039584875106812
<http://www.yso.fi/onto/yso/p5714>	muinaishaudat	0.10336380451917648

The documents are passed as batches, i.e. lists of texts (lists are generated by a generator also in the case of suggest command with files) to the backend.py module, which defines the default _suggest_batch method that uses the regular, single-text _suggest method. This allows the actual backends to define their own _suggest_batch methods that can operate on the document batch.

The implementation of document batching for the hyperopt CLI command is left from this PR, as the hyperopt functionality is implemented in individual backends.

Note that there is no support the actual batch-processing in any backends yet.

@juhoinkinen juhoinkinen added this to the Short term milestone Jan 20, 2023
@codecov
Copy link

codecov bot commented Jan 20, 2023

Codecov Report

Base: 99.55% // Head: 89.44% // Decreases project coverage by -10.11% ⚠️

Coverage data is based on head (ec16a21) compared to base (ca4d61c).
Patch coverage: 97.94% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff             @@
##           master     #663       +/-   ##
===========================================
- Coverage   99.55%   89.44%   -10.11%     
===========================================
  Files          87       87               
  Lines        6017     6142      +125     
===========================================
- Hits         5990     5494      -496     
- Misses         27      648      +621     
Impacted Files Coverage Δ
tests/test_backend_fasttext.py 100.00% <ø> (ø)
tests/test_backend_nn_ensemble.py 6.77% <0.00%> (-93.23%) ⬇️
tests/test_backend_omikuji.py 5.15% <0.00%> (-94.85%) ⬇️
tests/test_backend_pav.py 100.00% <ø> (ø)
tests/test_backend_yake.py 6.94% <0.00%> (-93.06%) ⬇️
annif/backend/backend.py 100.00% <100.00%> (ø)
annif/backend/dummy.py 100.00% <100.00%> (ø)
annif/backend/ensemble.py 100.00% <100.00%> (ø)
annif/backend/pav.py 98.87% <100.00%> (ø)
annif/cli.py 99.70% <100.00%> (+0.01%) ⬆️
... and 28 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

…ommand as necessary

The test was not working as intended: the command got input as stdin
@juhoinkinen
Copy link
Member Author

juhoinkinen commented Jan 23, 2023

Some open thoughts & questions:

  • Using DocumentList object made the implementation for ìndex command a bit easier that using plain list of texts. But actually, instead of using DocumentList class there could be a DocumentBatch class just for this use, but would that give any benefits.
  • Should there be a limit for number of docs that can be send via REST batch-suggest? Or
    is it enough to have the possibility to set a payload limit for the method in e.g. NGINX?
  • Maybe batch-suggest would be a better name for the method than suggest-batch.

@juhoinkinen juhoinkinen marked this pull request as ready for review January 27, 2023 07:41
@juhoinkinen juhoinkinen requested a review from osma January 27, 2023 07:41
Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good starting point for batching. Also, it already provides some new functionality (possibility to pass many documents in suggest operations via CLI and REST) so I think it would be good to try to merge this first, even before any actual improved implementations in the backends have been developed.

What worries me here a little is the potentially large size of batches. On the REST side they are naturally a bit limited by request size (and maybe we need some other limit as well?), but on the CLI side, it's not uncommon to use index and eval on thousands of large documents at once. But I don't think it makes sense to process that many documents in a single operation on the backend level - more likely, processing something like 16 or 32 documents at once (let's call it a minibatch) would already give a performance boost, and any larger minibatch size would probably just increase memory overhead with diminishing returns.

It's good to use DocumentList here on the "outer" level, because it's generator based and thus scales naturally to even huge numbers of documents. But I don't think it should be passed directly to backends. Instead, some intermediate layer (probably project.py, or the layer above i.e. cli.py and rest.py) should chop this up into minibatches which are then passed to the backend methods, maybe just as simple lists of text strings. The results (hit_sets) from the backends would have the same size as the minibatch and these would then be assembled back to a single iterable - a generator would probably be a good idea here too, since the number of documents can be huge.

I gave a few minor detailed comments on the code.

I won't comment on the naming in this round, let's see first how the code evolves.

annif/cli.py Outdated Show resolved Hide resolved
tests/test_project.py Outdated Show resolved Hide resolved
tests/test_swagger.py Outdated Show resolved Hide resolved
annif/project.py Outdated Show resolved Hide resolved
annif/cli.py Outdated Show resolved Hide resolved
annif/cli.py Outdated
@@ -345,35 +364,51 @@ def run_learn(project_id, paths, docs_limit, backend_param):

@cli.command("suggest")
@click.argument("project_id")
@click.argument("paths", type=click.Path(dir_okay=False, exists=True), nargs=-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: Should it be possible to pass a directory (containing .txt documents) as well?

@juhoinkinen
Copy link
Member Author

I tried to utilize the new suggest_batch function of the project module for the eval CLI command, but it does not seem very straightforward.

First, the eval command uses the imap_unordered, which expects an iterable as the second argument), but suggest_batch operates on a corpus object (that has documents iterable only as its property). Also, I'm not sure if imap_unordered could in any way feed multiple documents to the suggest_batch function: it has the chunksize parameter, but afaik even when setting that to a non-default value (!=1) it does not send multiple elements from the iterable to the function in one pass.

@juhoinkinen
Copy link
Member Author

juhoinkinen commented Feb 1, 2023

I added BatchingDocumentCorpus to help getting batches of documents, but I wonder could the doc_batches(batch_size) method be added already to the DocumentCorpus base class, as batching of documents is used quite many times. (TODO: add testing of the batching method.)

(Black v23.1.0 was just released, and it introduced some changes to the style, which raise complaints for the current Annif code. :( )

@osma
Copy link
Member

osma commented Feb 1, 2023

I wonder could the doc_batches(batch_size) method be added already to the DocumentCorpus base class, as batching of documents is used quite many times.

Yes, this sounds like a great idea! I think it would simplify the code and enable batching in many places.

@@ -83,6 +83,30 @@ def open_doc_path(path, subject_index):
return docs


def open_text_documents(paths, docs_limit):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds yet another utility function to the top of cli.py. Nothing wrong with that, but cli.py has grown very long so I think we could refactor this in a follow-up PR, moving the utility functions to a separate module such as cli_util.py. Then cli.py would just contain the Click-decorated functions that implement the CLI commands themselves.

annif/project.py Outdated Show resolved Hide resolved
Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already gave a comment about moving DOC_BATCH_SIZE inside DocumentCorpus since it's only needed there.

Apart from that, I think there's an opportunity for further simplification. Basically we don't need the single-text versions of suggest methods anymore, as they can be handled as a special case of suggest_batch. This may seem a bit radical but I think it's not very hard (well, except maybe fixing up all the tests):

  1. The old AnnifProject.suggest() method is now needed only in two places, in the cli.py run_suggest function (the stdin-only case) and in the rest.py suggest function. Convert these two to use the AnnifProject.suggest_batch() method instead - passing a list with one text.
  2. Remove the now unused AnnifProject.suggest() method as well as AnnifProject._suggest_with_backend() that it relies on (but nothing else needs it)
  3. Remove the now unused AnnifBackend.suggest() method.

It may also make sense to rename the remaining suggest_batch() methods to just suggest(), now that the shorter name is available. Though we still need both AnnifBackend._suggest and AnnifBackend._suggest_batch, because it's up to backends which variant they will implement.

@osma
Copy link
Member

osma commented Feb 3, 2023

Oh by the way, I tested the annif eval command (yso-mllm-fi project, evaluated on kirjastonhoitaja test set) with the -j 4 option, before and after this PR. The evaluation results were the same. It took a few seconds longer with the code in this PR, probably because the parallel processing is done on larger batches and the final batches cannot be distributed between CPUs. But I think that's not too bad, and we should see performance gains in the future as we implement support for batching in individual backends.

@juhoinkinen
Copy link
Member Author

Now some debug-level logs from project module are removed:
Before:

annif suggest tfidf-fi <kissa.txt -v DEBUG
debug: creating app with configuration annif.default_config.Config
debug: Reading configuration file projects.cfg in CFG format
debug: loading subjects from data/vocabs/yso/subjects.csv
debug: Suggesting subjects for text "kissa
debug: ..." (len=6)
debug: Backend tfidf: loading vectorizer from data/projects/tfidf-fi/vectorizer
debug: Backend tfidf: loading similarity index from data/projects/tfidf-fi/tfidf-index
debug: Backend tfidf: Suggesting subjects for text "kissa
debug: ..." (len=6)
debug: Got 100 hits from backend tfidf
debug: 100 hits from backend
<http://www.yso.fi/onto/yso/p19378>	kissa	0.9530429244041443
<http://www.yso.fi/onto/yso/p864>	kissaeläimet	0.5541983842849731

Now:

annif suggest tfidf-fi <kissa.txt -v DEBUG
debug: creating app with configuration annif.default_config.Config
debug: Reading configuration file projects.cfg in CFG format
debug: loading subjects from data/vocabs/yso/subjects.csv
debug: Backend tfidf: loading vectorizer from data/projects/tfidf-fi/vectorizer
debug: Backend tfidf: loading similarity index from data/projects/tfidf-fi/tfidf-index
debug: Backend tfidf: Suggesting subjects for text "kissa
debug: ..." (len=6)
<http://www.yso.fi/onto/yso/p19378>	kissa	0.9530429244041443
<http://www.yso.fi/onto/yso/p864>	kissaeläimet	0.5541983842849731

And in the case giving multiple files to annif suggest, the debug log has some duplication:

annif suggest tfidf-fi *.txt -v DEBUG
debug: creating app with configuration annif.default_config.Config
debug: Reading configuration file projects.cfg in CFG format
debug: loading subjects from data/vocabs/yso/subjects.csv
debug: Backend tfidf: loading vectorizer from data/projects/tfidf-fi/vectorizer
debug: Backend tfidf: loading similarity index from data/projects/tfidf-fi/tfidf-index
debug: Backend tfidf: Suggesting subjects for text "kissa
debug:  cat
debug: dog
debug: viiki..." (len=24)
debug: Backend tfidf: Suggesting subjects for text "kissa
debug: ..." (len=6)
debug: Backend tfidf: Suggesting subjects for text "koira
debug: ..." (len=6)
debug: Backend tfidf: Suggesting subjects for text "laiva
debug: ..." (len=6)
debug: Backend tfidf: Suggesting subjects for text "viikinki
debug: ..." (len=9)

@osma
Copy link
Member

osma commented Feb 3, 2023

Now some debug-level logs from project module are removed

Right, because the debug information was printed only in the single text suggest methods. I don't think this is a big loss.

And in the case giving multiple files to annif suggest, the debug log has some duplication:

I don't see any exact duplicates - the messages are related to different input files, right?

Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Maybe the PR title could be amended, as it also covers the eval command now (but only the CLI side)

@juhoinkinen juhoinkinen changed the title Support for batch suggest operations in suggest and index methods Support for batch suggest operations for CLI commands Feb 3, 2023
@juhoinkinen juhoinkinen modified the milestones: Short term, 0.61 Feb 3, 2023
@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 3, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 6 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants