Convert word counts to u64 #1433

stephenroller · 2024-01-17T07:07:22Z

Fixes #437, where BPE trainer will overflow and fail to merge the most common words into the vocabulary when training on a very large corpus.

julien-c · 2024-01-17T10:36:30Z

hi @stephenroller - pinging @LysandreJik @ArthurZucker

stephenroller · 2024-01-17T14:27:14Z

whoops. I did meant to do this in my fork.

stephenroller · 2024-01-17T14:35:59Z

However, since I did open source it, I don't mind if the patch is upstreamed. This fixes #437.

I haven't tested it yet but I can report back if it fixes it.

stephenroller · 2024-01-27T12:14:56Z

This worked pretty well at fixing the overflow issue. I didn't time it. Can we at least let CI run?

HuggingFaceDocBuilderDev · 2024-01-29T14:51:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2024-01-30T01:22:43Z

Sure! For the tests, make bench will help tell if it's slower !

ArthurZucker

LGTM otherwise

stephenroller · 2024-02-03T14:16:59Z

Parent revision (888dd4b):

BPE Train vocabulary (small)
                        time:   [47.449 ms 48.051 ms 48.541 ms]
                        change: [-2.9189% -1.4847% -0.1209%] (p = 0.06 > 0.05)
                        No change in performance detected.
slope  [47.449 ms 48.541 ms] R^2            [0.9924777 0.9934990]
mean   [47.375 ms 48.547 ms] std. dev.      [643.16 µs 1.2093 ms]
median [47.092 ms 48.832 ms] med. abs. dev. [79.033 µs 1.6688 ms]

Benchmarking BPE Train vocabulary (big)
Benchmarking BPE Train vocabulary (big): Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 15.3s.
Benchmarking BPE Train vocabulary (big): Collecting 10 samples in estimated 15.301 s (10 iterations)
Benchmarking BPE Train vocabulary (big): Analyzing
BPE Train vocabulary (big)
                        time:   [1.5374 s 1.5507 s 1.5621 s]
                        change: [-0.1107% +1.5458% +3.0647%] (p = 0.09 > 0.05)
                        No change in performance detected.
mean   [1.5374 s 1.5621 s] std. dev.      [9.5283 ms 26.677 ms]
median [1.5343 s 1.5676 s] med. abs. dev. [3.1427 ms 34.238 ms]

This revision (fd24c27):

BPE Train vocabulary (small)
                        time:   [47.520 ms 47.762 ms 48.089 ms]
slope  [47.520 ms 48.089 ms] R^2            [0.9983095 0.9979029]
mean   [47.509 ms 48.107 ms] std. dev.      [300.75 µs 638.42 µs]
median [47.403 ms 48.241 ms] med. abs. dev. [91.004 µs 873.39 µs]

Benchmarking BPE Train vocabulary (big)
Benchmarking BPE Train vocabulary (big): Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 15.5s.
Benchmarking BPE Train vocabulary (big): Collecting 10 samples in estimated 15.538 s (10 iterations)
Benchmarking BPE Train vocabulary (big): Analyzing
BPE Train vocabulary (big)
                        time:   [1.5555 s 1.5662 s 1.5777 s]
mean   [1.5555 s 1.5777 s] std. dev.      [11.281 ms 23.900 ms]
median [1.5500 s 1.5799 s] med. abs. dev. [2.7785 ms 32.001 ms]

ArthurZucker · 2024-02-06T02:39:10Z

Thanks for the PR and making sure we have no performance issues! Merging 🤗

stephenroller added 2 commits January 17, 2024 06:52

Convert word counts to u64

3ddcb2d

More spots needed to compile

fd24c27

stephenroller marked this pull request as draft January 17, 2024 14:36

stephenroller marked this pull request as ready for review January 27, 2024 12:14

ArthurZucker approved these changes Jan 30, 2024

View reviewed changes

This comment was marked as outdated.

Sign in to view

ArthurZucker merged commit 4a8105c into huggingface:main Feb 6, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert word counts to u64 #1433

Convert word counts to u64 #1433

stephenroller commented Jan 17, 2024 •

edited

Loading

julien-c commented Jan 17, 2024

stephenroller commented Jan 17, 2024

stephenroller commented Jan 17, 2024

stephenroller commented Jan 27, 2024

HuggingFaceDocBuilderDev commented Jan 29, 2024

ArthurZucker commented Jan 30, 2024

ArthurZucker left a comment

This comment was marked as outdated.

stephenroller commented Feb 3, 2024

ArthurZucker commented Feb 6, 2024

Convert word counts to u64 #1433

Convert word counts to u64 #1433

Conversation

stephenroller commented Jan 17, 2024 • edited Loading

julien-c commented Jan 17, 2024

stephenroller commented Jan 17, 2024

stephenroller commented Jan 17, 2024

stephenroller commented Jan 27, 2024

HuggingFaceDocBuilderDev commented Jan 29, 2024

ArthurZucker commented Jan 30, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

stephenroller commented Feb 3, 2024

ArthurZucker commented Feb 6, 2024

stephenroller commented Jan 17, 2024 •

edited

Loading