-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev.ej/compact lexicon #400
Conversation
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #400 +/- ##
==========================================
+ Coverage 93.82% 93.89% +0.06%
==========================================
Files 18 18
Lines 2575 2587 +12
Branches 577 580 +3
==========================================
+ Hits 2416 2429 +13
Misses 91 91
+ Partials 68 67 -1 ☔ View full report in Codecov by Sentry. |
0c76367
to
6991b4e
Compare
Store the lexicon and joined groups of 16 entries to reduce the str object memory overhead. Experimented with various block sizes to see the memory impact Measured by running `import g2p; g2p.get_arpabet_lang()`: - original: 71MB - blocks of 4: 59MB - blocks of 16: 56MB - blocks of 256: 55MB I decided the 15MB RAM savings were worth it for blocks of 16, but the gain beyond that is trivial and not worth it. In terms of speed the original code and blocks of 16 are the same, at least within the error of measurement, which was running `g2p convert --file en.txt eng eng-ipa` where en.txt is a file containing all the words in the cmudict lexicon: original and 16 both took 20-21 seconds depending on the run. At blocks of 256, I was getting 23 seconds, not a big difference, but measurable for not significant memory gain.
6991b4e
to
d03aabb
Compare
I think this actually is the memory mapping solution we're looking for, because if we put all the data together in a single (or multiple, whatever) block, then... we can easily memory map that block, without needing any external dependencies. But also, if we support some kind of compact trainable G2P rules, we don't need to store the lexicon entries that are predictable by the rules, so that's something to think about (I could resume my debogosifying efforts on Phonetisaurus for instance) |
Before we use this as memory mapping, I'd want to convert the solution from splitting the compacted blocks into memory into reading the entries in place. But yeah, if we add the ability to access individual entries via string slices, we could turn the whole list into a single string, i.e., just one compact block. Question @dhdaines do we merge this anyway, take the 15mb saving, and plan a future update with a single block as described above, or do we hold off on this PR?
|
Yes, merge it now, because the internal representation of our rules/lexicons is not a public API/ABI so it really doesn't matter if we change it. |
Sounds good. 😁 can you approve it then? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, though you could add the assertions / comments mentioned if you want to make sure future generations understand what's going on :)
Yes, adding that additional documentation (and possibly assertions) would be helpful, this is definitely not obvious code. Which is part of why I was as extensive as I was for unit testing, making unit testing confirm to me I didn't miss any corner cases. |
Thanks for the careful review and these comments. |
19be394 documents the algorithm much more clearly now. |
19be394
to
291708d
Compare
PR Goal?
A bit of a crazy idea to reduce the amount of memory used by the g2p library: the English lexicon takes a lot of space in memory because each word is stored in a string, and each string object in Python is quite large, well beyond the size of the actual character sequence.
So, I thought, let's see what happens if I compact them by joining them by blocks. And a) it save 15MB in RAM from loading langs, b) with no measurable speed cost. The real cost is the increased complexity of the
find_alignment()
function ing2p.mappings.utils
, but it goes from six lines to 15 lines, so, really, not that bad.I've tested carefully to make sure the results are identical, and speed is not adversely affected.
The
langs.json.gz
file is a tad larger, but the memory footprint from loading it is significantly smaller.Fixes?
Alternative solution to the problem raised in #395
Feedback sought?
Careful review.
And, whether we can reasonably expect a memory mapping solution soon, which would make this patch irrelevant.
Priority?
low
Tests added?
I better do that... coverage was not as good as I expected, and I will fix that to make sure I've tested all the corner cases.Yes
How to test?
g2p convert "some test" eng eng-ipa
still works as expectedConfidence?
high
Version change?
nope