Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3x Faster Text Preprocessing #6711

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ashvardanian
Copy link

I was preparing some datasets for AI training and noticed that datasets by HuggingFace uses the conventional open mechanism to read the file and split it into chunks. I thought it can be significantly accelerated, and started with a benchmark:

$ pip install --upgrade --force-reinstall datasets
$ python benchmark_huggingface_datasets.py xlsum.csv 
Generating train split: 1004598 examples [00:47, 21116.16 examples/s]
Time taken to load the dataset: 48.66838526725769 seconds
Time taken to chunk the dataset into parts of size 10000: 0.11466407775878906 seconds
Total time taken: 48.78304934501648 seconds

For benchmarks I've used a large CSV file with mixed UTF-8 content, most common in modern large-scale pre-training pipelines. I've later patched the datasets library to use stringzilla, which resulted in significantly lower memory consumption and in 2.9x throughput improvement on the AWS r7iz instances. That's using slow SSDs mounted over the network. Performance on local SSDs on something like a DGX-H100 should be even higher:

$ pip install -e .
$ python benchmark_huggingface_datasets.py xlsum.csv 
Generating train split: 1004598 examples [00:15, 64529.90 examples/s]
Time taken to load the dataset: 16.45028805732727 seconds
Time taken to chunk the dataset into parts of size 10000: 0.1291060447692871 seconds
Total time taken: 16.579394102096558 seconds

I've already pushed the patches to my fork, and would love to contribute them to the upstream repository.


All the tests pass, but they leave a couple of important questions open. The default Python open(..., newline=None) uses universal newlines, where \n, \r, and \r\n are all converted to \n on the fly. I am not sure if its a good idea for a general purpose dataset preparation pipeline?

I can simulate the same behavior (which I don't yet do) for "line" splitter. Adjusting it for "paragraph"-splitter would be harder. Should we stick exactly to the old Pythonic behavior or stay closer to how C and other programming languages do that?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi ! Unfortunately we don't like adding new dependencies to datasets

Maybe we can improve speed and memory usage using built-in tools though, like regex finditer ?

@ashvardanian
Copy link
Author

Unfortunately, that won't improve the performance. StringZilla repository has extensive benchmarks comparing against different built-in functionality of several programming languages. Using re.finditer for tokenization is practically the slowest anti-pattern I've encountered in any language. The gap between that and a SIMD-accelerated kernel can be as big as 10 MB/s vs 10 GB/s.

I understand the need to keep the dependencies minimal. It helps the package remain small and portable. At this point, StringZilla provides 105 binaries for different OS and hardware versions (more portable than NumPy) and the binary size generally ranges from 50 KB to 250 KB, smaller than a single JPEG.

@mariosasko
Copy link
Collaborator

The text builder is not very popular, so I'm also not a fan of introducing a dependency for it.

Moreover, I couldn't find any projects of this size/usage depending on StringZilla (with GitHub search), so we should at least wait for its greater adoption to merge this PR.

@danielealbano
Copy link

danielealbano commented Jun 26, 2024

Moreover, I couldn't find any projects of this size/usage depending on StringZilla (with GitHub search), so we should at least wait for its greater adoption to merge this PR.

Meanwhile I understand that you want to wait for a greater adoption - if things change in the future you would be stuck with an unsupported dependency (although I think, that really applies to everything) - the performance improvement is really significant!
I wonder if it's worth, perhaps, to provide an additional 'datasets.extras' library by huggingface which support these 3rd party improvements.
It would reduce the risk on the core components and, at the same time, it would definitely help on the performance side!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants