[docs] Compress data files #5691

stevhliu · 2023-03-31T17:17:26Z

This PR addresses the comments in #5687 about compressing text file extensions before uploading to the Hub. Also clarified what "too large" means based on the GitLFS docs.

HuggingFaceDocBuilderDev · 2023-03-31T17:23:08Z

The documentation is not available anymore as the PR was closed or merged.

albertvillanova

Thanks a lot @stevhliu, for the update on audio/image file extensions and the recommendation to compress text files (this will also have a positive impact in downloading time).

There is however a confusion about the size limits:

From GitHub docs, there are 2 limits on file sizes:
- One for just "Git" files (non Git-LFS): up to 100 MB (see: https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github#file-size-limits)
- One for Git-LFS files: up to 5 GB for GitHub Enterprise Cloud (https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage#about-git-large-file-storage)

However, the limits above are enforced when pushing to GitHub. For our Hugging Face Hub, I think these size limits are different. I guess in the comment you added, you should refer to the 100 MB (for files not tracked by Git-LFS) file size limit instead of 5 GB (limit for files tracked by Git-LFS). This should be confirmed.

stevhliu · 2023-04-03T17:14:26Z

Confirmed with the Hub team the file size limit for the Hugging Face Hub is 10MB :)

albertvillanova

Thanks for the PR and also for the confirmation of file size limits. Great!

github-actions · 2023-04-19T07:33:01Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006789 / 0.011353 (-0.004564)	0.004935 / 0.011008 (-0.006073)	0.096796 / 0.038508 (0.058288)	0.032485 / 0.023109 (0.009376)	0.335342 / 0.275898 (0.059444)	0.354999 / 0.323480 (0.031519)	0.005467 / 0.007986 (-0.002519)	0.005267 / 0.004328 (0.000939)	0.073988 / 0.004250 (0.069737)	0.044402 / 0.037052 (0.007350)	0.331156 / 0.258489 (0.072666)	0.363595 / 0.293841 (0.069754)	0.035301 / 0.128546 (-0.093245)	0.012141 / 0.075646 (-0.063505)	0.333164 / 0.419271 (-0.086107)	0.048818 / 0.043533 (0.005286)	0.331458 / 0.255139 (0.076319)	0.343567 / 0.283200 (0.060367)	0.094963 / 0.141683 (-0.046720)	1.444383 / 1.452155 (-0.007772)	1.520093 / 1.492716 (0.027377)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.212311 / 0.018006 (0.194305)	0.436413 / 0.000490 (0.435923)	0.000333 / 0.000200 (0.000133)	0.000057 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026670 / 0.037411 (-0.010742)	0.105774 / 0.014526 (0.091248)	0.115796 / 0.176557 (-0.060760)	0.176504 / 0.737135 (-0.560631)	0.121883 / 0.296338 (-0.174456)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.400783 / 0.215209 (0.185574)	4.006608 / 2.077655 (1.928953)	1.817659 / 1.504120 (0.313539)	1.619777 / 1.541195 (0.078582)	1.684247 / 1.468490 (0.215757)	0.701116 / 4.584777 (-3.883661)	3.684056 / 3.745712 (-0.061656)	2.065258 / 5.269862 (-3.204603)	1.425460 / 4.565676 (-3.140217)	0.084519 / 0.424275 (-0.339757)	0.011949 / 0.007607 (0.004342)	0.496793 / 0.226044 (0.270749)	4.978864 / 2.268929 (2.709935)	2.303388 / 55.444624 (-53.141237)	1.978341 / 6.876477 (-4.898135)	2.055744 / 2.142072 (-0.086329)	0.832022 / 4.805227 (-3.973206)	0.164715 / 6.500664 (-6.335949)	0.062701 / 0.075469 (-0.012768)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.178723 / 1.841788 (-0.663065)	14.583986 / 8.074308 (6.509678)	14.189402 / 10.191392 (3.998010)	0.183867 / 0.680424 (-0.496557)	0.017565 / 0.534201 (-0.516636)	0.421345 / 0.579283 (-0.157938)	0.420235 / 0.434364 (-0.014129)	0.496758 / 0.540337 (-0.043580)	0.591558 / 1.386936 (-0.795378)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007019 / 0.011353 (-0.004334)	0.004996 / 0.011008 (-0.006012)	0.073345 / 0.038508 (0.034836)	0.033077 / 0.023109 (0.009968)	0.335954 / 0.275898 (0.060056)	0.372616 / 0.323480 (0.049136)	0.005678 / 0.007986 (-0.002308)	0.003906 / 0.004328 (-0.000423)	0.072841 / 0.004250 (0.068591)	0.046829 / 0.037052 (0.009777)	0.335177 / 0.258489 (0.076688)	0.382862 / 0.293841 (0.089021)	0.038406 / 0.128546 (-0.090141)	0.012110 / 0.075646 (-0.063536)	0.085796 / 0.419271 (-0.333476)	0.049896 / 0.043533 (0.006363)	0.338232 / 0.255139 (0.083093)	0.361054 / 0.283200 (0.077855)	0.103171 / 0.141683 (-0.038512)	1.556692 / 1.452155 (0.104538)	1.540023 / 1.492716 (0.047306)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.223705 / 0.018006 (0.205699)	0.438771 / 0.000490 (0.438282)	0.002838 / 0.000200 (0.002639)	0.000081 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028423 / 0.037411 (-0.008988)	0.110560 / 0.014526 (0.096035)	0.121629 / 0.176557 (-0.054928)	0.173638 / 0.737135 (-0.563498)	0.127062 / 0.296338 (-0.169277)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.425806 / 0.215209 (0.210597)	4.251051 / 2.077655 (2.173397)	2.059735 / 1.504120 (0.555615)	1.864886 / 1.541195 (0.323692)	1.941553 / 1.468490 (0.473063)	0.700084 / 4.584777 (-3.884693)	3.753150 / 3.745712 (0.007438)	3.218606 / 5.269862 (-2.051256)	1.439648 / 4.565676 (-3.126028)	0.085239 / 0.424275 (-0.339037)	0.012026 / 0.007607 (0.004419)	0.521564 / 0.226044 (0.295520)	5.217902 / 2.268929 (2.948973)	2.557831 / 55.444624 (-52.886793)	2.240223 / 6.876477 (-4.636254)	2.364664 / 2.142072 (0.222591)	0.825884 / 4.805227 (-3.979343)	0.167800 / 6.500664 (-6.332864)	0.063552 / 0.075469 (-0.011917)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.255532 / 1.841788 (-0.586256)	14.747783 / 8.074308 (6.673475)	14.352263 / 10.191392 (4.160871)	0.143659 / 0.680424 (-0.536765)	0.017517 / 0.534201 (-0.516684)	0.419863 / 0.579283 (-0.159421)	0.416674 / 0.434364 (-0.017690)	0.485694 / 0.540337 (-0.054643)	0.584810 / 1.386936 (-0.802126)

add doc about file extensions

3e178a4

stevhliu requested a review from albertvillanova March 31, 2023 17:17

albertvillanova reviewed Apr 3, 2023

View reviewed changes

update size limit

c48efef

stevhliu requested a review from albertvillanova April 11, 2023 16:57

albertvillanova linked an issue Apr 19, 2023 that may be closed by this pull request

Document to compress data files before uploading #5687

Closed

albertvillanova approved these changes Apr 19, 2023

View reviewed changes

albertvillanova merged commit 61db0e9 into huggingface:main Apr 19, 2023

stevhliu deleted the compress-data branch April 19, 2023 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Compress data files #5691

[docs] Compress data files #5691

stevhliu commented Mar 31, 2023

HuggingFaceDocBuilderDev commented Mar 31, 2023 •

edited

Loading

albertvillanova left a comment •

edited

Loading

stevhliu commented Apr 3, 2023 •

edited

Loading

albertvillanova left a comment

github-actions bot commented Apr 19, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

[docs] Compress data files #5691

[docs] Compress data files #5691

Conversation

stevhliu commented Mar 31, 2023

HuggingFaceDocBuilderDev commented Mar 31, 2023 • edited Loading

albertvillanova left a comment • edited Loading

Choose a reason for hiding this comment

stevhliu commented Apr 3, 2023 • edited Loading

albertvillanova left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 19, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Mar 31, 2023 •

edited

Loading

albertvillanova left a comment •

edited

Loading

stevhliu commented Apr 3, 2023 •

edited

Loading