Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was preparing some datasets for AI training and noticed that
datasets
by HuggingFace uses the conventionalopen
mechanism to read the file and split it into chunks. I thought it can be significantly accelerated, and started with a benchmark:$ pip install --upgrade --force-reinstall datasets $ python benchmark_huggingface_datasets.py xlsum.csv Generating train split: 1004598 examples [00:47, 21116.16 examples/s] Time taken to load the dataset: 48.66838526725769 seconds Time taken to chunk the dataset into parts of size 10000: 0.11466407775878906 seconds Total time taken: 48.78304934501648 seconds
For benchmarks I've used a large CSV file with mixed UTF-8 content, most common in modern large-scale pre-training pipelines. I've later patched the
datasets
library to usestringzilla
, which resulted in significantly lower memory consumption and in 2.9x throughput improvement on the AWSr7iz
instances. That's using slow SSDs mounted over the network. Performance on local SSDs on something like a DGX-H100 should be even higher:I've already pushed the patches to my fork, and would love to contribute them to the upstream repository.
All the tests pass, but they leave a couple of important questions open. The default Python
open(..., newline=None)
uses universal newlines, where\n
,\r
, and\r\n
are all converted to\n
on the fly. I am not sure if its a good idea for a general purpose dataset preparation pipeline?I can simulate the same behavior (which I don't yet do) for
"line"
splitter. Adjusting it for"paragraph"
-splitter would be harder. Should we stick exactly to the old Pythonic behavior or stay closer to how C and other programming languages do that?