Improvement: WikiCorpus class now can receive multiple tokenizing functions, that can be simple, list or tuple. #3553

fabriciorsf · 2024-07-23T06:16:30Z

If you want to queue the processing of a sequence of different custom tokenizers, now the WikiCorpus class can receive multiple tokenizing functions, that can be simple, list or tuple.

To see a usage example: https://github.com/LINE-PESC/gensim/blob/55a7454c274cb9802ceea38a9e5782dad735210d/gensim/test/test_corpora.py#L736

Thus, can execute several combination of tokenizers.

…tokenizing functions.

…list of tokenizing functions.

…ctions, that can be simple, list or tuple. Addition of tests for the WikiCorpus class to be able to receive multiple tokenizing functions.

gojomo · 2024-07-26T18:24:22Z

I can see why some would want to stack a bunch of tokenizers to run in order.

But, this code's particular loop of (tokenize, re-join, repeat) is a mild anti-pattern in terms of efficiency - sure, you'll do it if all your tokenizer functions only take untokenized strings, but in any custom complex tokenizing code, you'd try to avoid extra re-joins. It's fine if it works but not necessarily a practice to implicitly endorse as routine/normal in the API. (And, note that it adds an unnecessary join/split overhead to the usual single-tokenizer case!)

If you need this functionality & OK with such inefficiency, a few of the same code lines outside WikiCorpus can wrap N tokenizers into a single tokenizer_func to pass WikiCorpus. EG:

tokenizers = [...]
def composite_tokenizer(text, token_min_len, token_max_len, lower):
    for tokenizer in tokenizers:
    text = " ".join(tokenizer(text, token_min_len, token_max_len, lower))
    return text.split()
# then, pass this composite_tokenizer in as a single tokenizer_func

I'd rather not grow the WikiCorpus API/codebase to support something so easy to do outside it, only when needed.

piskvorky · 2024-07-26T19:08:29Z

@gojomo is exactly right – why not do such preprocessing externally? Does Gensim need to know?

fabriciorsf · 2024-07-27T16:36:12Z

In my case, I need to queue two or more tokenizers in different orders and compare the results.
In addition to Gensim's tokenize, I also use other external tokenizers such as TreebankWordTokenizer from nltk.tokenize.treebank, and the order in which they are executed makes a difference.
It would be more inefficient to call the different tokenizers externally to Gensim. This solution was designed to decouple Gensim from the external tokenizers.

gojomo · 2024-07-29T17:58:26Z

From my perspective, wrapping any set of arbirary external tokenizers into something Gensim sees as a single function is even stronger decoupling, making Gensim oblivious even to the fact the tokenizer is a composite of many steps.

I don't see any way moving the "call-each-in-a-loop" logic inside a Gensim class would improve efficiency; the same steps are happening in either implementation.

fabriciorsf added 14 commits July 12, 2024 02:30

rebase from origin/develop (remove: custom tokenize function as args)

efeb297

rebase from origin/develop (remove: custom tokenize function as args)

dedd230

fix concatenate fields

983e756

support multi tokenizers

c30ed72

tokenizer(s) can be simple, list or tuple

42f126a

fix rebase wikicorpus.py (remove: custom tokenize function as args)

dd84b17

fix rebase gensim .gitignore

e913124

Merge branch 'piskvorky:develop' into develop

c964f92

support multiple tokenizer(s), that can be simple, list or tuple

c8183e8

Addition of tests so that the WikiCorpus class can receive a list of …

de027da

…tokenizing functions.

update gensim wikicorpus doc

f2b2ce1

Correction of tests for the WikiCorpus class to be able to receive a …

9ce1e14

…list of tokenizing functions.

Improvement: WikiCorpus class now can receive multiple tokenizing fun…

64a3727

…ctions, that can be simple, list or tuple. Addition of tests for the WikiCorpus class to be able to receive multiple tokenizing functions.

Merge https://github.com/LINE-PESC/gensim into HEAD

55a7454

fabriciorsf closed this Jul 23, 2024

fabriciorsf reopened this Jul 23, 2024

fabriciorsf closed this Jul 23, 2024

fabriciorsf reopened this Jul 23, 2024

fabriciorsf marked this pull request as ready for review July 23, 2024 13:52

fabriciorsf closed this Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement: WikiCorpus class now can receive multiple tokenizing functions, that can be simple, list or tuple. #3553

Improvement: WikiCorpus class now can receive multiple tokenizing functions, that can be simple, list or tuple. #3553

fabriciorsf commented Jul 23, 2024

gojomo commented Jul 26, 2024

piskvorky commented Jul 26, 2024

fabriciorsf commented Jul 27, 2024

gojomo commented Jul 29, 2024

Improvement: WikiCorpus class now can receive multiple tokenizing functions, that can be simple, list or tuple. #3553

Improvement: WikiCorpus class now can receive multiple tokenizing functions, that can be simple, list or tuple. #3553

Conversation

fabriciorsf commented Jul 23, 2024

gojomo commented Jul 26, 2024

piskvorky commented Jul 26, 2024

fabriciorsf commented Jul 27, 2024

gojomo commented Jul 29, 2024