Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert word counts to u64 #1433

Merged
merged 2 commits into from
Feb 6, 2024
Merged

Conversation

stephenroller
Copy link
Contributor

@stephenroller stephenroller commented Jan 17, 2024

Fixes #437, where BPE trainer will overflow and fail to merge the most common words into the vocabulary when training on a very large corpus.

@julien-c
Copy link
Member

hi @stephenroller - pinging @LysandreJik @ArthurZucker

@stephenroller
Copy link
Contributor Author

whoops. I did meant to do this in my fork.

@stephenroller
Copy link
Contributor Author

However, since I did open source it, I don't mind if the patch is upstreamed. This fixes #437.

I haven't tested it yet but I can report back if it fixes it.

@stephenroller stephenroller marked this pull request as draft January 17, 2024 14:36
@stephenroller
Copy link
Contributor Author

This worked pretty well at fixing the overflow issue. I didn't time it. Can we at least let CI run?

@stephenroller stephenroller marked this pull request as ready for review January 27, 2024 12:14
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker
Copy link
Collaborator

Sure! For the tests, make bench will help tell if it's slower !

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

ArthurZucker

This comment was marked as outdated.

@stephenroller
Copy link
Contributor Author

Parent revision (888dd4b):

BPE Train vocabulary (small)
                        time:   [47.449 ms 48.051 ms 48.541 ms]
                        change: [-2.9189% -1.4847% -0.1209%] (p = 0.06 > 0.05)
                        No change in performance detected.
slope  [47.449 ms 48.541 ms] R^2            [0.9924777 0.9934990]
mean   [47.375 ms 48.547 ms] std. dev.      [643.16 µs 1.2093 ms]
median [47.092 ms 48.832 ms] med. abs. dev. [79.033 µs 1.6688 ms]

Benchmarking BPE Train vocabulary (big)
Benchmarking BPE Train vocabulary (big): Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 15.3s.
Benchmarking BPE Train vocabulary (big): Collecting 10 samples in estimated 15.301 s (10 iterations)
Benchmarking BPE Train vocabulary (big): Analyzing
BPE Train vocabulary (big)
                        time:   [1.5374 s 1.5507 s 1.5621 s]
                        change: [-0.1107% +1.5458% +3.0647%] (p = 0.09 > 0.05)
                        No change in performance detected.
mean   [1.5374 s 1.5621 s] std. dev.      [9.5283 ms 26.677 ms]
median [1.5343 s 1.5676 s] med. abs. dev. [3.1427 ms 34.238 ms]

This revision (fd24c27):

BPE Train vocabulary (small)
                        time:   [47.520 ms 47.762 ms 48.089 ms]
slope  [47.520 ms 48.089 ms] R^2            [0.9983095 0.9979029]
mean   [47.509 ms 48.107 ms] std. dev.      [300.75 µs 638.42 µs]
median [47.403 ms 48.241 ms] med. abs. dev. [91.004 µs 873.39 µs]

Benchmarking BPE Train vocabulary (big)
Benchmarking BPE Train vocabulary (big): Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 15.5s.
Benchmarking BPE Train vocabulary (big): Collecting 10 samples in estimated 15.538 s (10 iterations)
Benchmarking BPE Train vocabulary (big): Analyzing
BPE Train vocabulary (big)
                        time:   [1.5555 s 1.5662 s 1.5777 s]
mean   [1.5555 s 1.5777 s] std. dev.      [11.281 ms 23.900 ms]
median [1.5500 s 1.5799 s] med. abs. dev. [2.7785 ms 32.001 ms]

@ArthurZucker
Copy link
Collaborator

Thanks for the PR and making sure we have no performance issues! Merging 🤗

@ArthurZucker ArthurZucker merged commit 4a8105c into huggingface:main Feb 6, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"the" token is splitted to "t" "h" "e" in large scale corpus
4 participants