Tutorial for creating a dataset #5540

stevhliu · 2023-02-16T22:09:35Z

A tutorial for creating datasets based on the folder-based builders and from_dict and from_generator methods. I've also mentioned loading scripts as a next step, but I think we should keep the tutorial focused on the low-code methods. Let me know what you think! 🙂

HuggingFaceDocBuilderDev · 2023-02-16T22:14:10Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Thanks ! A few comments:

docs/source/create_dataset.mdx

lhoestq

Thanks !

docs/source/create_dataset.mdx

mariosasko

LGTM!

github-actions · 2023-02-17T18:50:46Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.012018 / 0.011353 (0.000665)	0.006204 / 0.011008 (-0.004804)	0.134119 / 0.038508 (0.095611)	0.038436 / 0.023109 (0.015327)	0.381397 / 0.275898 (0.105499)	0.456362 / 0.323480 (0.132882)	0.009826 / 0.007986 (0.001840)	0.004746 / 0.004328 (0.000417)	0.103755 / 0.004250 (0.099505)	0.043867 / 0.037052 (0.006815)	0.395322 / 0.258489 (0.136833)	0.475812 / 0.293841 (0.181971)	0.057865 / 0.128546 (-0.070682)	0.019919 / 0.075646 (-0.055727)	0.465343 / 0.419271 (0.046072)	0.061574 / 0.043533 (0.018041)	0.371668 / 0.255139 (0.116529)	0.400375 / 0.283200 (0.117176)	0.106539 / 0.141683 (-0.035144)	1.822931 / 1.452155 (0.370776)	1.875535 / 1.492716 (0.382819)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.013583 / 0.018006 (-0.004423)	0.535515 / 0.000490 (0.535025)	0.007920 / 0.000200 (0.007720)	0.000305 / 0.000054 (0.000250)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030204 / 0.037411 (-0.007207)	0.131671 / 0.014526 (0.117145)	0.143977 / 0.176557 (-0.032579)	0.175498 / 0.737135 (-0.561637)	0.166134 / 0.296338 (-0.130204)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.630995 / 0.215209 (0.415786)	6.152275 / 2.077655 (4.074620)	2.519887 / 1.504120 (1.015767)	2.110926 / 1.541195 (0.569732)	2.207555 / 1.468490 (0.739064)	1.296197 / 4.584777 (-3.288580)	5.510619 / 3.745712 (1.764906)	3.167468 / 5.269862 (-2.102394)	2.043924 / 4.565676 (-2.521753)	0.144772 / 0.424275 (-0.279503)	0.014456 / 0.007607 (0.006848)	0.783629 / 0.226044 (0.557585)	7.836962 / 2.268929 (5.568033)	3.248593 / 55.444624 (-52.196032)	2.577092 / 6.876477 (-4.299385)	2.671918 / 2.142072 (0.529846)	1.471586 / 4.805227 (-3.333641)	0.251391 / 6.500664 (-6.249273)	0.091947 / 0.075469 (0.016478)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.594839 / 1.841788 (-0.246949)	18.250630 / 8.074308 (10.176322)	23.948781 / 10.191392 (13.757389)	0.275505 / 0.680424 (-0.404919)	0.045202 / 0.534201 (-0.488999)	0.545552 / 0.579283 (-0.033731)	0.639352 / 0.434364 (0.204989)	0.666345 / 0.540337 (0.126008)	0.795614 / 1.386936 (-0.591322)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.011234 / 0.011353 (-0.000119)	0.005983 / 0.011008 (-0.005025)	0.109144 / 0.038508 (0.070636)	0.036070 / 0.023109 (0.012961)	0.429313 / 0.275898 (0.153415)	0.490615 / 0.323480 (0.167135)	0.007448 / 0.007986 (-0.000538)	0.004424 / 0.004328 (0.000095)	0.097100 / 0.004250 (0.092850)	0.049719 / 0.037052 (0.012667)	0.412719 / 0.258489 (0.154230)	0.485717 / 0.293841 (0.191876)	0.061168 / 0.128546 (-0.067378)	0.021510 / 0.075646 (-0.054136)	0.116598 / 0.419271 (-0.302673)	0.066116 / 0.043533 (0.022583)	0.426212 / 0.255139 (0.171073)	0.448368 / 0.283200 (0.165168)	0.116003 / 0.141683 (-0.025680)	1.799329 / 1.452155 (0.347175)	1.967256 / 1.492716 (0.474540)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.214893 / 0.018006 (0.196887)	0.497843 / 0.000490 (0.497354)	0.000464 / 0.000200 (0.000264)	0.000094 / 0.000054 (0.000039)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031758 / 0.037411 (-0.005653)	0.131182 / 0.014526 (0.116656)	0.141251 / 0.176557 (-0.035305)	0.186526 / 0.737135 (-0.550609)	0.142975 / 0.296338 (-0.153363)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.662094 / 0.215209 (0.446885)	6.664841 / 2.077655 (4.587186)	2.690613 / 1.504120 (1.186493)	2.305399 / 1.541195 (0.764205)	2.383697 / 1.468490 (0.915207)	1.280692 / 4.584777 (-3.304085)	5.629215 / 3.745712 (1.883503)	5.007083 / 5.269862 (-0.262778)	2.482163 / 4.565676 (-2.083513)	0.147662 / 0.424275 (-0.276613)	0.017770 / 0.007607 (0.010163)	0.818380 / 0.226044 (0.592335)	8.006521 / 2.268929 (5.737592)	3.472262 / 55.444624 (-51.972363)	2.709550 / 6.876477 (-4.166926)	2.775138 / 2.142072 (0.633066)	1.570545 / 4.805227 (-3.234683)	0.266323 / 6.500664 (-6.234341)	0.090591 / 0.075469 (0.015122)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.657927 / 1.841788 (-0.183861)	18.448981 / 8.074308 (10.374673)	20.336909 / 10.191392 (10.145517)	0.230322 / 0.680424 (-0.450102)	0.025972 / 0.534201 (-0.508229)	0.561361 / 0.579283 (-0.017922)	0.623758 / 0.434364 (0.189394)	0.664120 / 0.540337 (0.123783)	0.763144 / 1.386936 (-0.623792)

* first draft of tutorial * apply feedbacks * add import iterabledataset

This reverts commit ca58769.

first draft of tutorial

02698b4

stevhliu requested review from albertvillanova, polinaeterna, lhoestq and mariosasko February 16, 2023 22:09

lhoestq reviewed Feb 17, 2023

View reviewed changes

docs/source/create_dataset.mdx Outdated Show resolved Hide resolved

docs/source/create_dataset.mdx Outdated Show resolved Hide resolved

apply feedbacks

efa3d30

lhoestq approved these changes Feb 17, 2023

View reviewed changes

docs/source/create_dataset.mdx Show resolved Hide resolved

add import iterabledataset

724be72

mariosasko approved these changes Feb 17, 2023

View reviewed changes

stevhliu merged commit 29de617 into huggingface:main Feb 17, 2023

stevhliu deleted the create-ds-tutorial branch February 17, 2023 18:41

AJDERS pushed a commit to AJDERS/datasets that referenced this pull request Feb 21, 2023

Tutorial for creating a dataset (huggingface#5540)

ca58769

* first draft of tutorial * apply feedbacks * add import iterabledataset

AJDERS added a commit to AJDERS/datasets that referenced this pull request Feb 21, 2023

Revert "Tutorial for creating a dataset (huggingface#5540)"

19bfcc6

This reverts commit ca58769.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial for creating a dataset #5540

Tutorial for creating a dataset #5540

stevhliu commented Feb 16, 2023

HuggingFaceDocBuilderDev commented Feb 16, 2023 •

edited

Loading

lhoestq left a comment

lhoestq left a comment

mariosasko left a comment

github-actions bot commented Feb 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Tutorial for creating a dataset #5540

Tutorial for creating a dataset #5540

Conversation

stevhliu commented Feb 16, 2023

HuggingFaceDocBuilderDev commented Feb 16, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

mariosasko left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Feb 16, 2023 •

edited

Loading