Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev.ej/lexicon tokenizer #405

Merged
merged 7 commits into from
Nov 12, 2024
Merged

Dev.ej/lexicon tokenizer #405

merged 7 commits into from
Nov 12, 2024

Conversation

joanise
Copy link
Collaborator

@joanise joanise commented Oct 31, 2024

PR Goal?

Introduce the lexicon tokenizer

Fixes?

Fixes #401

Feedback sought?

This is a draft PR, seeking early feedback on the algorithm I implemented.

Priority?

medium-high

Tests added?

yes

How to test?

Run g2p convert "We'll let go, she'll stop." eng eng-ipa and see the output wil lɛt ɡoʊ, ʃil stɑp. instead of wi' lɛt ɡoʊ, ʃi' stɑp.

Confidence?

moderate only.

  • unit testing tells me it's fine
  • but I have not yet done any benchmarking to see if we need to optimize

Version change?

Probably yes, at least a patch bump, since this is a bug fix, maybe even a minor bump?

@joanise joanise marked this pull request as draft October 31, 2024 21:25
@joanise joanise requested review from dhdaines and roedoejet October 31, 2024 21:25
Copy link
Contributor

github-actions bot commented Oct 31, 2024

CLI load time: 0:00.05
Pull Request HEAD: c3d73bfa8014c34c7dc004637465ca453f9e9cbf
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

Copy link

codecov bot commented Oct 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.29%. Comparing base (1262cbb) to head (c3d73bf).
Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #405      +/-   ##
==========================================
+ Coverage   93.89%   94.29%   +0.40%     
==========================================
  Files          18       18              
  Lines        2587     2664      +77     
  Branches      580      598      +18     
==========================================
+ Hits         2429     2512      +83     
+ Misses         91       88       -3     
+ Partials       67       64       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@joanise joanise marked this pull request as ready for review November 1, 2024 19:19
Base automatically changed from dev.ej/misc to main November 1, 2024 19:26
@joanise
Copy link
Collaborator Author

joanise commented Nov 4, 2024

Some benchmarking notes: I tested g2p'ing the UNDHR (https://www.un.org/en/about-us/universal-declaration-of-human-rights) text, and it's actually about 1% faster on this branch than on origin/main, so clearly this implementation is fine on "normal" text.

I created a degenerate case with a word surrounded by dozens of punctuation marks on each side, and in that case I can see on slow-down on this branch, which makes sense because I'm doing O(n^2) lookups, when n is the number of candidate tokens, and each lookup costs O(log k), where k is the lexicon size. I'm incline to say YAGNI on optimizing this: it's only a problem on artificial data. Alternatively, I could cap the number of non-alphabetic characters before or after alphabetic ones that I include in the lexicon lookups, as a defensive programming solution.

This was referenced Nov 7, 2024
@roedoejet
Copy link
Owner

I'm incline to say YAGNI on optimizing this

Agreed. Thanks for looking into this though @joanise !

Copy link
Owner

@roedoejet roedoejet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple small typos, but this looks good to me. Nice that you fit in a few optimizations as well. Thanks @joanise - tested and it works well on my machine.

g2p/mappings/tokenizer.py Outdated Show resolved Hide resolved
g2p/mappings/utils.py Outdated Show resolved Hide resolved
joanise and others added 7 commits November 12, 2024 15:06
While merge_if_same_label was more generic, we never reused it, and it
was really hard to understand what it did.
Also:
 - resolve ensuing typing errors
 - Add more typing declarations to make it all coherent
 - Add a __all__ to g2p/__init__.py because otherwise, mypy doesn't like
   that we import Token there without using it explicitly: it in indeed
   imported just so API users can import it, so this is logical.
@joanise joanise force-pushed the dev.ej/lexicon-tokenizer branch from ba1962f to c3d73bf Compare November 12, 2024 20:11
@joanise joanise merged commit b3ee783 into main Nov 12, 2024
8 checks passed
@joanise joanise deleted the dev.ej/lexicon-tokenizer branch November 12, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible issue with tokenization
2 participants