-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for batch suggest operations for CLI commands #663
Conversation
… plain list of documents
Codecov ReportBase: 99.55% // Head: 89.44% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #663 +/- ##
===========================================
- Coverage 99.55% 89.44% -10.11%
===========================================
Files 87 87
Lines 6017 6142 +125
===========================================
- Hits 5990 5494 -496
- Misses 27 648 +621
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
…ommand as necessary The test was not working as intended: the command got input as stdin
Some open thoughts & questions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good starting point for batching. Also, it already provides some new functionality (possibility to pass many documents in suggest
operations via CLI and REST) so I think it would be good to try to merge this first, even before any actual improved implementations in the backends have been developed.
What worries me here a little is the potentially large size of batches. On the REST side they are naturally a bit limited by request size (and maybe we need some other limit as well?), but on the CLI side, it's not uncommon to use index
and eval
on thousands of large documents at once. But I don't think it makes sense to process that many documents in a single operation on the backend level - more likely, processing something like 16 or 32 documents at once (let's call it a minibatch) would already give a performance boost, and any larger minibatch size would probably just increase memory overhead with diminishing returns.
It's good to use DocumentList here on the "outer" level, because it's generator based and thus scales naturally to even huge numbers of documents. But I don't think it should be passed directly to backends. Instead, some intermediate layer (probably project.py
, or the layer above i.e. cli.py
and rest.py
) should chop this up into minibatches which are then passed to the backend methods, maybe just as simple lists of text strings. The results (hit_sets) from the backends would have the same size as the minibatch and these would then be assembled back to a single iterable - a generator would probably be a good idea here too, since the number of documents can be huge.
I gave a few minor detailed comments on the code.
I won't comment on the naming in this round, let's see first how the code evolves.
annif/cli.py
Outdated
@@ -345,35 +364,51 @@ def run_learn(project_id, paths, docs_limit, backend_param): | |||
|
|||
@cli.command("suggest") | |||
@click.argument("project_id") | |||
@click.argument("paths", type=click.Path(dir_okay=False, exists=True), nargs=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: Should it be possible to pass a directory (containing .txt documents) as well?
Also support mixing "-" and real file paths; in that case show hits in file-like output
I tried to utilize the new First, the eval command uses the imap_unordered, which expects an iterable as the second argument), but |
I added (Black v23.1.0 was just released, and it introduced some changes to the style, which raise complaints for the current Annif code. :( ) |
Yes, this sounds like a great idea! I think it would simplify the code and enable batching in many places. |
@@ -83,6 +83,30 @@ def open_doc_path(path, subject_index): | |||
return docs | |||
|
|||
|
|||
def open_text_documents(paths, docs_limit): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds yet another utility function to the top of cli.py. Nothing wrong with that, but cli.py has grown very long so I think we could refactor this in a follow-up PR, moving the utility functions to a separate module such as cli_util.py. Then cli.py would just contain the Click-decorated functions that implement the CLI commands themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already gave a comment about moving DOC_BATCH_SIZE inside DocumentCorpus since it's only needed there.
Apart from that, I think there's an opportunity for further simplification. Basically we don't need the single-text versions of suggest methods anymore, as they can be handled as a special case of suggest_batch. This may seem a bit radical but I think it's not very hard (well, except maybe fixing up all the tests):
- The old AnnifProject.suggest() method is now needed only in two places, in the cli.py run_suggest function (the stdin-only case) and in the rest.py suggest function. Convert these two to use the AnnifProject.suggest_batch() method instead - passing a list with one text.
- Remove the now unused AnnifProject.suggest() method as well as AnnifProject._suggest_with_backend() that it relies on (but nothing else needs it)
- Remove the now unused AnnifBackend.suggest() method.
It may also make sense to rename the remaining suggest_batch() methods to just suggest(), now that the shorter name is available. Though we still need both AnnifBackend._suggest and AnnifBackend._suggest_batch, because it's up to backends which variant they will implement.
Oh by the way, I tested the |
Now some debug-level logs from project module are removed:
Now:
And in the case giving multiple files to
|
Right, because the debug information was printed only in the single text
I don't see any exact duplicates - the messages are related to different input files, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Maybe the PR title could be amended, as it also covers the eval
command now (but only the CLI side)
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Adds support for passing multiple documents in a batch from
suggest
,index
,eval
andoptimize
CLI commands to backends (like discussed in issue #579).Makes
annif suggest
accept path(s) to file(s) to be indexed, in addition to stdin:The output for each file path input begins with line
Suggestions for <file/path>
:The documents are passed as batches, i.e. lists of texts (lists are generated by a generator also in the case of
suggest
command with files) to thebackend.py
module, which defines the default_suggest_batch
method that uses the regular, single-text_suggest
method. This allows the actual backends to define their own_suggest_batch
methods that can operate on the document batch.The implementation of document batching for the
hyperopt
CLI command is left from this PR, as the hyperopt functionality is implemented in individual backends.Note that there is no support the actual batch-processing in any backends yet.