-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev.ej/lexicon tokenizer #405
Conversation
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #405 +/- ##
==========================================
+ Coverage 93.89% 94.29% +0.40%
==========================================
Files 18 18
Lines 2587 2664 +77
Branches 580 598 +18
==========================================
+ Hits 2429 2512 +83
+ Misses 91 88 -3
+ Partials 67 64 -3 ☔ View full report in Codecov by Sentry. |
Some benchmarking notes: I tested g2p'ing the UNDHR (https://www.un.org/en/about-us/universal-declaration-of-human-rights) text, and it's actually about 1% faster on this branch than on I created a degenerate case with a word surrounded by dozens of punctuation marks on each side, and in that case I can see on slow-down on this branch, which makes sense because I'm doing O(n^2) lookups, when n is the number of candidate tokens, and each lookup costs O(log k), where k is the lexicon size. I'm incline to say YAGNI on optimizing this: it's only a problem on artificial data. Alternatively, I could cap the number of non-alphabetic characters before or after alphabetic ones that I include in the lexicon lookups, as a defensive programming solution. |
Agreed. Thanks for looking into this though @joanise ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple small typos, but this looks good to me. Nice that you fit in a few optimizations as well. Thanks @joanise - tested and it works well on my machine.
While merge_if_same_label was more generic, we never reused it, and it was really hard to understand what it did.
Also: - resolve ensuing typing errors - Add more typing declarations to make it all coherent - Add a __all__ to g2p/__init__.py because otherwise, mypy doesn't like that we import Token there without using it explicitly: it in indeed imported just so API users can import it, so this is logical.
ba1962f
to
c3d73bf
Compare
PR Goal?
Introduce the lexicon tokenizer
Fixes?
Fixes #401
Feedback sought?
This is a draft PR, seeking early feedback on the algorithm I implemented.
Priority?
medium-high
Tests added?
yes
How to test?
Run
g2p convert "We'll let go, she'll stop." eng eng-ipa
and see the outputwil lɛt ɡoʊ, ʃil stɑp.
instead ofwi' lɛt ɡoʊ, ʃi' stɑp.
Confidence?
moderate only.
Version change?
Probably yes, at least a patch bump, since this is a bug fix, maybe even a minor bump?