Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement: WikiCorpus class now can receive multiple tokenizing functions, that can be simple, list or tuple. #3553

Closed
wants to merge 14 commits into from

Conversation

fabriciorsf
Copy link

If you want to queue the processing of a sequence of different custom tokenizers, now the WikiCorpus class can receive multiple tokenizing functions, that can be simple, list or tuple.

To see a usage example: https://github.com/LINE-PESC/gensim/blob/55a7454c274cb9802ceea38a9e5782dad735210d/gensim/test/test_corpora.py#L736

Thus, can execute several combination of tokenizers.

@fabriciorsf fabriciorsf reopened this Jul 23, 2024
@fabriciorsf fabriciorsf reopened this Jul 23, 2024
@fabriciorsf fabriciorsf marked this pull request as ready for review July 23, 2024 13:52
@gojomo
Copy link
Collaborator

gojomo commented Jul 26, 2024

I can see why some would want to stack a bunch of tokenizers to run in order.

But, this code's particular loop of (tokenize, re-join, repeat) is a mild anti-pattern in terms of efficiency - sure, you'll do it if all your tokenizer functions only take untokenized strings, but in any custom complex tokenizing code, you'd try to avoid extra re-joins. It's fine if it works but not necessarily a practice to implicitly endorse as routine/normal in the API. (And, note that it adds an unnecessary join/split overhead to the usual single-tokenizer case!)

If you need this functionality & OK with such inefficiency, a few of the same code lines outside WikiCorpus can wrap N tokenizers into a single tokenizer_func to pass WikiCorpus. EG:

tokenizers = [...]
def composite_tokenizer(text, token_min_len, token_max_len, lower):
    for tokenizer in tokenizers:
    text = " ".join(tokenizer(text, token_min_len, token_max_len, lower))
    return text.split()
# then, pass this composite_tokenizer in as a single tokenizer_func

I'd rather not grow the WikiCorpus API/codebase to support something so easy to do outside it, only when needed.

@piskvorky
Copy link
Owner

@gojomo is exactly right – why not do such preprocessing externally? Does Gensim need to know?

@fabriciorsf
Copy link
Author

In my case, I need to queue two or more tokenizers in different orders and compare the results.
In addition to Gensim's tokenize, I also use other external tokenizers such as TreebankWordTokenizer from nltk.tokenize.treebank, and the order in which they are executed makes a difference.
It would be more inefficient to call the different tokenizers externally to Gensim. This solution was designed to decouple Gensim from the external tokenizers.

@gojomo
Copy link
Collaborator

gojomo commented Jul 29, 2024

From my perspective, wrapping any set of arbirary external tokenizers into something Gensim sees as a single function is even stronger decoupling, making Gensim oblivious even to the fact the tokenizer is a composite of many steps.

I don't see any way moving the "call-each-in-a-loop" logic inside a Gensim class would improve efficiency; the same steps are happening in either implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants