Improved the tutorial by adding a link for loading datasets #7042

AmboThom · 2024-07-12T03:49:54Z

Improved the tutorial by letting readers know about loading datasets with common files and including a link. I left the local files section alone because the methods were already listed with code snippets.

lhoestq

thanks !

github-actions · 2024-08-15T10:07:43Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005135 / 0.011353 (-0.006218)	0.003389 / 0.011008 (-0.007619)	0.063053 / 0.038508 (0.024545)	0.031597 / 0.023109 (0.008487)	0.237519 / 0.275898 (-0.038379)	0.263101 / 0.323480 (-0.060379)	0.003109 / 0.007986 (-0.004877)	0.002699 / 0.004328 (-0.001630)	0.048611 / 0.004250 (0.044361)	0.042937 / 0.037052 (0.005884)	0.253760 / 0.258489 (-0.004729)	0.275444 / 0.293841 (-0.018397)	0.028952 / 0.128546 (-0.099594)	0.011837 / 0.075646 (-0.063809)	0.207620 / 0.419271 (-0.211651)	0.035727 / 0.043533 (-0.007806)	0.241770 / 0.255139 (-0.013369)	0.270509 / 0.283200 (-0.012691)	0.020709 / 0.141683 (-0.120974)	1.135722 / 1.452155 (-0.316432)	1.200355 / 1.492716 (-0.292361)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092555 / 0.018006 (0.074549)	0.284719 / 0.000490 (0.284229)	0.000210 / 0.000200 (0.000010)	0.000049 / 0.000054 (-0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018431 / 0.037411 (-0.018980)	0.063618 / 0.014526 (0.049092)	0.075371 / 0.176557 (-0.101185)	0.120982 / 0.737135 (-0.616153)	0.075718 / 0.296338 (-0.220620)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.279439 / 0.215209 (0.064230)	2.722274 / 2.077655 (0.644619)	1.442314 / 1.504120 (-0.061806)	1.323166 / 1.541195 (-0.218029)	1.339642 / 1.468490 (-0.128848)	0.723451 / 4.584777 (-3.861326)	2.334879 / 3.745712 (-1.410833)	2.938745 / 5.269862 (-2.331116)	1.867278 / 4.565676 (-2.698398)	0.078704 / 0.424275 (-0.345571)	0.005128 / 0.007607 (-0.002479)	0.338634 / 0.226044 (0.112589)	3.266239 / 2.268929 (0.997311)	1.815276 / 55.444624 (-53.629349)	1.487158 / 6.876477 (-5.389319)	1.547550 / 2.142072 (-0.594522)	0.804458 / 4.805227 (-4.000769)	0.139186 / 6.500664 (-6.361479)	0.042935 / 0.075469 (-0.032534)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.978223 / 1.841788 (-0.863564)	11.350997 / 8.074308 (3.276689)	10.082980 / 10.191392 (-0.108412)	0.145067 / 0.680424 (-0.535357)	0.014132 / 0.534201 (-0.520069)	0.302162 / 0.579283 (-0.277121)	0.264603 / 0.434364 (-0.169761)	0.338466 / 0.540337 (-0.201871)	0.427891 / 1.386936 (-0.959045)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006078 / 0.011353 (-0.005275)	0.004030 / 0.011008 (-0.006978)	0.051646 / 0.038508 (0.013138)	0.031263 / 0.023109 (0.008154)	0.279437 / 0.275898 (0.003539)	0.304489 / 0.323480 (-0.018991)	0.004553 / 0.007986 (-0.003433)	0.002869 / 0.004328 (-0.001459)	0.050638 / 0.004250 (0.046387)	0.041091 / 0.037052 (0.004038)	0.290681 / 0.258489 (0.032192)	0.332059 / 0.293841 (0.038218)	0.033353 / 0.128546 (-0.095193)	0.012506 / 0.075646 (-0.063141)	0.061788 / 0.419271 (-0.357484)	0.034150 / 0.043533 (-0.009382)	0.278258 / 0.255139 (0.023119)	0.298084 / 0.283200 (0.014885)	0.019106 / 0.141683 (-0.122577)	1.164475 / 1.452155 (-0.287679)	1.204804 / 1.492716 (-0.287912)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.100053 / 0.018006 (0.082047)	0.301255 / 0.000490 (0.300765)	0.000220 / 0.000200 (0.000020)	0.000057 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023536 / 0.037411 (-0.013876)	0.078513 / 0.014526 (0.063987)	0.090281 / 0.176557 (-0.086276)	0.129607 / 0.737135 (-0.607528)	0.090742 / 0.296338 (-0.205596)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.304082 / 0.215209 (0.088873)	2.909401 / 2.077655 (0.831747)	1.587210 / 1.504120 (0.083090)	1.458713 / 1.541195 (-0.082482)	1.472579 / 1.468490 (0.004089)	0.716542 / 4.584777 (-3.868235)	0.947557 / 3.745712 (-2.798155)	2.908044 / 5.269862 (-2.361817)	1.886382 / 4.565676 (-2.679294)	0.078105 / 0.424275 (-0.346170)	0.005802 / 0.007607 (-0.001805)	0.357883 / 0.226044 (0.131839)	3.490958 / 2.268929 (1.222029)	1.946574 / 55.444624 (-53.498050)	1.645167 / 6.876477 (-5.231310)	1.649242 / 2.142072 (-0.492830)	0.796864 / 4.805227 (-4.008363)	0.134206 / 6.500664 (-6.366458)	0.041439 / 0.075469 (-0.034030)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.012311 / 1.841788 (-0.829477)	12.396967 / 8.074308 (4.322659)	10.382494 / 10.191392 (0.191102)	0.157395 / 0.680424 (-0.523029)	0.015154 / 0.534201 (-0.519047)	0.302209 / 0.579283 (-0.277074)	0.127430 / 0.434364 (-0.306934)	0.348933 / 0.540337 (-0.191404)	0.442930 / 1.386936 (-0.944006)

Improved the tutorial by adding a link for loading datasets

9e33303

lhoestq approved these changes Aug 15, 2024

View reviewed changes

lhoestq merged commit 69d9f45 into huggingface:main Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved the tutorial by adding a link for loading datasets #7042

Improved the tutorial by adding a link for loading datasets #7042

AmboThom commented Jul 12, 2024

lhoestq left a comment

github-actions bot commented Aug 15, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Improved the tutorial by adding a link for loading datasets #7042

Improved the tutorial by adding a link for loading datasets #7042

Conversation

AmboThom commented Jul 12, 2024

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 15, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json