Dev.ej/lexicon tokenizer #405

joanise · 2024-10-31T21:25:35Z

PR Goal?

Introduce the lexicon tokenizer

Fixes?

Fixes #401

Feedback sought?

This is a draft PR, seeking early feedback on the algorithm I implemented.

Priority?

medium-high

Tests added?

yes

How to test?

Run g2p convert "We'll let go, she'll stop." eng eng-ipa and see the output wil lɛt ɡoʊ, ʃil stɑp. instead of wi' lɛt ɡoʊ, ʃi' stɑp.

Confidence?

moderate only.

unit testing tells me it's fine
but I have not yet done any benchmarking to see if we need to optimize

Version change?

Probably yes, at least a patch bump, since this is a bug fix, maybe even a minor bump?

github-actions · 2024-10-31T21:28:43Z

CLI load time: 0:00.05
Pull Request HEAD: c3d73bfa8014c34c7dc004637465ca453f9e9cbf
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

codecov · 2024-10-31T21:32:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.29%. Comparing base (1262cbb) to head (c3d73bf).
Report is 8 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #405      +/-   ##
==========================================
+ Coverage   93.89%   94.29%   +0.40%     
==========================================
  Files          18       18              
  Lines        2587     2664      +77     
  Branches      580      598      +18     
==========================================
+ Hits         2429     2512      +83     
+ Misses         91       88       -3     
+ Partials       67       64       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

joanise · 2024-11-04T15:57:22Z

Some benchmarking notes: I tested g2p'ing the UNDHR (https://www.un.org/en/about-us/universal-declaration-of-human-rights) text, and it's actually about 1% faster on this branch than on origin/main, so clearly this implementation is fine on "normal" text.

I created a degenerate case with a word surrounded by dozens of punctuation marks on each side, and in that case I can see on slow-down on this branch, which makes sense because I'm doing O(n^2) lookups, when n is the number of candidate tokens, and each lookup costs O(log k), where k is the lexicon size. I'm incline to say YAGNI on optimizing this: it's only a problem on artificial data. Alternatively, I could cap the number of non-alphabetic characters before or after alphabetic ones that I include in the lexicon lookups, as a defensive programming solution.

roedoejet · 2024-11-12T18:41:30Z

I'm incline to say YAGNI on optimizing this

Agreed. Thanks for looking into this though @joanise !

roedoejet

Just a couple small typos, but this looks good to me. Nice that you fit in a few optimizations as well. Thanks @joanise - tested and it works well on my machine.

g2p/mappings/tokenizer.py

g2p/mappings/utils.py

While merge_if_same_label was more generic, we never reused it, and it was really hard to understand what it did.

…sed tok

Also: - resolve ensuing typing errors - Add more typing declarations to make it all coherent - Add a __all__ to g2p/__init__.py because otherwise, mypy doesn't like that we import Token there without using it explicitly: it in indeed imported just so API users can import it, so this is logical.

joanise marked this pull request as draft October 31, 2024 21:25

joanise requested review from dhdaines and roedoejet October 31, 2024 21:25

joanise marked this pull request as ready for review November 1, 2024 19:19

Base automatically changed from dev.ej/misc to main November 1, 2024 19:26

This was referenced Nov 7, 2024

Dev.ej/refactor tokens #406

Merged

Dev.ej/pre commit all #407

Merged

roedoejet approved these changes Nov 12, 2024

View reviewed changes

g2p/mappings/tokenizer.py Outdated Show resolved Hide resolved

g2p/mappings/utils.py Outdated Show resolved Hide resolved

joanise and others added 7 commits November 12, 2024 15:06

refactor: simplify merge_if_same_label to clearer merge_same_type_tokens

5682125

While merge_if_same_label was more generic, we never reused it, and it was really hard to understand what it did.

feat: add a lexicon-based tokenizer, esp. for English

c419518

refactor: move merge_non_word_tokens and split_non_word_tokens to utils

d662622

test: better unit testing for mappings.utils

0b2c83c

refactor: import utils as a whole instead of each function

163bc39

perf: prevent quadratic time cost of degenerate inputs for lexicon-ba…

24a28e0

…sed tok

joanise force-pushed the dev.ej/lexicon-tokenizer branch from ba1962f to c3d73bf Compare November 12, 2024 20:11

joanise merged commit b3ee783 into main Nov 12, 2024
8 checks passed

joanise deleted the dev.ej/lexicon-tokenizer branch November 12, 2024 20:21

roedoejet mentioned this pull request Nov 12, 2024

Possible issue with tokenization #401

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev.ej/lexicon tokenizer #405

Dev.ej/lexicon tokenizer #405

joanise commented Oct 31, 2024

github-actions bot commented Oct 31, 2024 •

edited

Loading

codecov bot commented Oct 31, 2024 •

edited

Loading

joanise commented Nov 4, 2024

roedoejet commented Nov 12, 2024

roedoejet left a comment

Dev.ej/lexicon tokenizer #405

Dev.ej/lexicon tokenizer #405

Conversation

joanise commented Oct 31, 2024

PR Goal?

Fixes?

Feedback sought?

Priority?

Tests added?

How to test?

Confidence?

Version change?

github-actions bot commented Oct 31, 2024 • edited Loading

codecov bot commented Oct 31, 2024 • edited Loading

Codecov Report

joanise commented Nov 4, 2024

roedoejet commented Nov 12, 2024

roedoejet left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 31, 2024 •

edited

Loading

codecov bot commented Oct 31, 2024 •

edited

Loading