-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement: WikiCorpus class now can receive multiple tokenizing functions, that can be simple, list or tuple. #3553
Conversation
…tokenizing functions.
…list of tokenizing functions.
…ctions, that can be simple, list or tuple. Addition of tests for the WikiCorpus class to be able to receive multiple tokenizing functions.
I can see why some would want to stack a bunch of tokenizers to run in order. But, this code's particular loop of (tokenize, re-join, repeat) is a mild anti-pattern in terms of efficiency - sure, you'll do it if all your tokenizer functions only take untokenized strings, but in any custom complex tokenizing code, you'd try to avoid extra re-joins. It's fine if it works but not necessarily a practice to implicitly endorse as routine/normal in the API. (And, note that it adds an unnecessary join/split overhead to the usual single-tokenizer case!) If you need this functionality & OK with such inefficiency, a few of the same code lines outside
I'd rather not grow the |
@gojomo is exactly right – why not do such preprocessing externally? Does Gensim need to know? |
In my case, I need to queue two or more tokenizers in different orders and compare the results. |
From my perspective, wrapping any set of arbirary external tokenizers into something Gensim sees as a single function is even stronger decoupling, making Gensim oblivious even to the fact the tokenizer is a composite of many steps. I don't see any way moving the "call-each-in-a-loop" logic inside a Gensim class would improve efficiency; the same steps are happening in either implementation. |
If you want to queue the processing of a sequence of different custom tokenizers, now the WikiCorpus class can receive multiple tokenizing functions, that can be simple, list or tuple.
To see a usage example: https://github.com/LINE-PESC/gensim/blob/55a7454c274cb9802ceea38a9e5782dad735210d/gensim/test/test_corpora.py#L736
Thus, can execute several combination of tokenizers.