Batch suggest operation #579

osma · 2022-03-14T09:41:44Z

Currently all the suggest methods (CLI command, REST API method, project and backend methods) always take just one document at a time. This is inefficient for backends that could process many documents in parallel.

We should introduce a batch version of suggest (called e.g. suggest_batch or suggest_many unless someone has a better idea?) for each of these contexts. Individual backends can then choose to implement it when it gives a performance boost; otherwise, the batch is simply passed to the regular suggest method one document at a time. I believe that at least NN ensemble, SVC, fastText and MLLM backends could benefit from parallel suggest operations. Also, this would be very useful for the proposed XTransformer backend.

A note on scope: This issue is about implementing the scaffolding necessary for batching suggest operations, as well as using them in at least some (not necessarily all) operations that would benefit from it: e.g. eval, hyperopt, optimize, index. Changes to individual backends are out of scope but separate issues for them should be opened after this basic scaffolding is in place.

The text was updated successfully, but these errors were encountered:

juhoinkinen · 2023-01-13T10:18:36Z

I played around with this a bit, and some questions arose.

CLI usage

If a new suggest-batch CLI command is added, from where should the documents be loadable when using it? From a directory (of text documents), or any paths, maybe to already indexed TSV files?
Alternatively, maybe the existing suggest CLI command could be turned to use batched processing if it is given a path to documents instead of stdin feed?

To where output the suggestions when using suggest-batch (or suggest ) for multiple documents?

Opt 1: To <doc-filename>.annif files similarly as the index command does, but then this just duplicates the index command function (and seems reasonable only for the directory input)
Opt 2: To stdout like the current suggest does, separating documents by first showing the document name and then on the following lines the subject suggestions for the document, e.g.:

tests/corpora/archaeology/fulltext/440866.txt
<http://www.yso.fi/onto/yso/p6218>	riimukirjoitus	0.3213897943496704
<http://www.yso.fi/onto/yso/p6479>	viikingit	0.18659920990467072
<http://www.yso.fi/onto/yso/p12738>	viikinkiaika	0.18625082075595856
<http://www.yso.fi/onto/yso/p22768>	Kiinan muuri	0.15950888395309448
<http://www.yso.fi/onto/yso/p3973>	antiikki	0.13840530812740326
<http://www.yso.fi/onto/yso/p14588>	riimukivet	0.1362432837486267
<http://www.yso.fi/onto/yso/p14173>	kaivaukset	0.1201547235250473
<http://www.yso.fi/onto/yso/p5713>	hautalöydöt	0.11249098181724548
<http://www.yso.fi/onto/yso/p15031>	viikinkiretket	0.11039584875106812
<http://www.yso.fi/onto/yso/p5714>	muinaishaudat	0.10336380451917648
tests/corpora/archaeology/fulltext/441563.txt
<http://www.yso.fi/onto/yso/p4625>	pronssikausi	0.33119136095046997
<http://www.yso.fi/onto/yso/p4622>	esihistoria	0.2926081418991089
<http://www.yso.fi/onto/yso/p1265>	arkeologia	0.24922890961170197
<http://www.yso.fi/onto/yso/p20096>	kansainvaellusaika	0.23529952764511108
<http://www.yso.fi/onto/yso/p9285>	neoliittinen kausi	0.23072052001953125
<http://www.yso.fi/onto/yso/p2558>	rautakausi	0.2238517701625824
<http://www.yso.fi/onto/yso/p4626>	varhaismetallikausi	0.2232591211795807
<http://www.yso.fi/onto/yso/p10849>	arkeologit	0.2182117998600006
<http://www.yso.fi/onto/yso/p7751>	kampakeraaminen kulttuuri	0.21752358973026276
<http://www.yso.fi/onto/yso/p14173>	kaivaukset	0.21643799543380737

The other CLI commands (eval, hyperopt, optimize, index) are intended to operate on multiple documents so the output question does not concern them

REST usage:

The best way to pass in the documents seems to use application/json encoding as in the learn method, but then how to pass parameters (language, limit, threshold)? I dont see a way to pass them the same way as for suggest method (which uses application/x-www-form-urlencoded)? Maybe put the parameters as an object in the json together with the documents:
```
  [
    {
      "parameters": [
        {
          "language": "string",
          "limit": 10,
          "threshold": 0
        }
      ]
    },
    {
      "documents": [
        {
          "text": "A quick brown fox jumped over the lazy dog."
        }
      ]
    }
  ]
```

osma · 2023-01-13T14:39:12Z

CLI

If a new suggest-batch CLI command is added, from where should the documents be loadable when using it? From a directory (of text documents), or any paths, maybe to already indexed TSV files?
Alternatively, maybe the existing suggest CLI command could be turned to use batched processing if it is given a path to documents instead of stdin feed?

My hunch would be to try to extend the current suggest CLI command instead of defining a new suggest-batch command. The current suggest command expects input from stdin; maybe we could change it so it works more like the cat command and other similar *nix tools, i.e. it could take one or more filenames as a parameter, but fall back to stdin if no file names are given. So you could do e.g.

annif suggest yso-tfidf-en <document.txt              # just like before
annif suggest yso-tfidf-en document.txt               # the same, but from a named file
annif suggest yso-tfidf-en doc1.txt doc2.txt doc3.txt # many files
annif suggest yso-tfidf-en doc*.txt                   # similar to above, but using shell expansion

Opt 2: To stdout like the current suggest does, separating documents by first showing the document name and then on the following lines the subject suggestions for the document, e.g.:

I think this is the way to go. For easier grepping etc., I would perhaps add some kind of extra tag in addition to the filename, something like:

Suggestions for tests/corpora/archaeology/fulltext/440866.txt
<http://www.yso.fi/onto/yso/p6218>	riimukirjoitus	0.3213897943496704
<http://www.yso.fi/onto/yso/p6479>	viikingit	0.18659920990467072
<http://www.yso.fi/onto/yso/p12738>	viikinkiaika	0.18625082075595856
<http://www.yso.fi/onto/yso/p22768>	Kiinan muuri	0.15950888395309448
<http://www.yso.fi/onto/yso/p3973>	antiikki	0.13840530812740326
<http://www.yso.fi/onto/yso/p14588>	riimukivet	0.1362432837486267
<http://www.yso.fi/onto/yso/p14173>	kaivaukset	0.1201547235250473
<http://www.yso.fi/onto/yso/p5713>	hautalöydöt	0.11249098181724548
<http://www.yso.fi/onto/yso/p15031>	viikinkiretket	0.11039584875106812
<http://www.yso.fi/onto/yso/p5714>	muinaishaudat	0.10336380451917648

I think it would be logical to use this output format whenever named files are used (instead of stdin), even if there is just one file.

The other CLI commands (eval, hyperopt, optimize, index) are intended to operate on multiple documents so the output question does not concern them

Yes, and I think this is where we could expect the most benefits. For example eval could potentially be much faster with some backends if it can use batched processing internally, even if it doesn't change anything from the user perspective so the command itself and its output remain the same.

REST

The best way to pass in the documents seems to use application/json encoding

I agree that JSON encoding seems like a good choice here, but there are other options that perhaps shouldn't be dismissed outright:

It's possible to use old fashioned application/x-www-form-urlencoded encoding, like the current suggest methods. There could be a field text or texts that is defined as an array in the OpenAPI spec (see Describing Request Body, section "Form Data"). In practice, this would mean that the values are repeated in the encoded body, like this: limit=10&threshold=0.2&text=doc1&text=doc2&text=doc3 (here doc1, doc2 and doc3 are placeholders for document text). For me, the main attraction of this would be that it may allow extending the current suggest method without defining a new suggest-batch method; though I suspect that the return data format would have to be different anyway, so maybe it would just create confusion.
It's also possible to use multipart requests where each document is a separate part, although I don't think we want to go there.

If we go for the JSON encoding, the parameters could also be given as URL parameters: POST /projects/yso-tfidf-en/suggest?limit=10&threshold=0.2, though I'm unsure if this is any better than just passing them in the JSON.

A little nitpick: why did you use an array here? I think a single object would be enough.

      "parameters": [
        {
          "language": "string",
          "limit": 10,
          "threshold": 0
        }
      ]

FWIW, I also checked the Maui Server API, but it doesn't have a batched version of suggest that we could copy.

juhoinkinen · 2023-06-26T13:48:00Z

The functionality this issue addressed was implemented by PRs #663 and #664.

Issues for implementing the batch functionality in individual backends have been opened and some of them have already been closed.

osma added the enhancement label Mar 14, 2022

osma added this to the Long term milestone Mar 14, 2022

osma mentioned this issue Mar 14, 2022

Add XTransformer backend #540

Closed

osma modified the milestones: Long term, Short term Sep 2, 2022

juhoinkinen self-assigned this Jan 7, 2023

juhoinkinen mentioned this issue Jan 20, 2023

Support for batch suggest operations for CLI commands #663

Merged

juhoinkinen mentioned this issue Jan 30, 2023

Add REST API method batch-suggest #664

Merged

juhoinkinen mentioned this issue Feb 21, 2023

Memory leak in NN ensemble backend #674

Closed

juhoinkinen modified the milestones: Short term, 0.61 Apr 4, 2023

juhoinkinen closed this as completed Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch suggest operation #579

Batch suggest operation #579

osma commented Mar 14, 2022

juhoinkinen commented Jan 13, 2023

osma commented Jan 13, 2023 •

edited

Loading

juhoinkinen commented Jun 26, 2023

Batch suggest operation #579

Batch suggest operation #579

Comments

osma commented Mar 14, 2022

juhoinkinen commented Jan 13, 2023

CLI usage

REST usage:

osma commented Jan 13, 2023 • edited Loading

CLI

REST

juhoinkinen commented Jun 26, 2023

osma commented Jan 13, 2023 •

edited

Loading