Skip to content

Commit

Permalink
Add OSCAR dataset card (#1833)
Browse files Browse the repository at this point in the history
* add oscar dataset card

* typo

* add oscar dataset card

* Update datasets/oscar/README.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Update datasets/oscar/README.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Update datasets/oscar/README.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Update datasets/oscar/README.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Update datasets/oscar/README.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Merged tables and fixed typos

* Update sizes of deduplicated configs in the table

Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
  • Loading branch information
3 people authored Feb 12, 2021
1 parent 2adc0ac commit f9df773
Showing 1 changed file with 5,755 additions and 1 deletion.
Loading

1 comment on commit f9df773

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==0.17.1

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.020511 / 0.011353 (0.009158) 0.016439 / 0.011008 (0.005431) 0.047983 / 0.038508 (0.009475) 0.034504 / 0.023109 (0.011395) 0.223612 / 0.275898 (-0.052286) 0.260805 / 0.323480 (-0.062675) 0.009623 / 0.007986 (0.001637) 0.004865 / 0.004328 (0.000537) 0.006892 / 0.004250 (0.002642) 0.047471 / 0.037052 (0.010419) 0.221516 / 0.258489 (-0.036973) 0.258921 / 0.293841 (-0.034920) 0.163036 / 0.128546 (0.034490) 0.133672 / 0.075646 (0.058026) 0.465309 / 0.419271 (0.046038) 0.459175 / 0.043533 (0.415642) 0.219502 / 0.255139 (-0.035637) 0.264413 / 0.283200 (-0.018787) 1.818752 / 0.141683 (1.677069) 1.952732 / 1.452155 (0.500577) 2.003916 / 1.492716 (0.511200)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.042036 / 0.037411 (0.004624) 0.020820 / 0.014526 (0.006295) 0.028230 / 0.176557 (-0.148326) 0.047460 / 0.737135 (-0.689676) 0.048125 / 0.296338 (-0.248213)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.275295 / 0.215209 (0.060086) 2.934191 / 2.077655 (0.856536) 1.491390 / 1.504120 (-0.012730) 1.351775 / 1.541195 (-0.189420) 1.435240 / 1.468490 (-0.033250) 7.576647 / 4.584777 (2.991870) 6.621372 / 3.745712 (2.875660) 9.219156 / 5.269862 (3.949295) 8.155026 / 4.565676 (3.589349) 0.748266 / 0.424275 (0.323991) 0.011427 / 0.007607 (0.003820) 0.342845 / 0.226044 (0.116800) 3.397014 / 2.268929 (1.128086) 1.988188 / 55.444624 (-53.456436) 1.677527 / 6.876477 (-5.198949) 1.693024 / 2.142072 (-0.449049) 7.611559 / 4.805227 (2.806331) 6.213084 / 6.500664 (-0.287580) 6.874019 / 0.075469 (6.798550)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.067157 / 1.841788 (10.225369) 16.000691 / 8.074308 (7.926383) 25.756312 / 10.191392 (15.564920) 0.513365 / 0.680424 (-0.167059) 0.335310 / 0.534201 (-0.198891) 0.936280 / 0.579283 (0.356997) 0.700977 / 0.434364 (0.266613) 0.769123 / 0.540337 (0.228786) 1.671362 / 1.386936 (0.284426)
PyArrow==1.0
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.019157 / 0.011353 (0.007805) 0.015521 / 0.011008 (0.004512) 0.046191 / 0.038508 (0.007683) 0.034163 / 0.023109 (0.011054) 0.345404 / 0.275898 (0.069506) 0.380235 / 0.323480 (0.056755) 0.009471 / 0.007986 (0.001486) 0.005088 / 0.004328 (0.000760) 0.006761 / 0.004250 (0.002510) 0.047481 / 0.037052 (0.010428) 0.348101 / 0.258489 (0.089612) 0.395925 / 0.293841 (0.102084) 0.160053 / 0.128546 (0.031507) 0.131553 / 0.075646 (0.055906) 0.458888 / 0.419271 (0.039617) 0.449326 / 0.043533 (0.405794) 0.354657 / 0.255139 (0.099518) 0.385237 / 0.283200 (0.102037) 1.816263 / 0.141683 (1.674580) 1.913135 / 1.452155 (0.460980) 2.035098 / 1.492716 (0.542381)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.044102 / 0.037411 (0.006691) 0.021560 / 0.014526 (0.007035) 0.038639 / 0.176557 (-0.137917) 0.056491 / 0.737135 (-0.680644) 0.029547 / 0.296338 (-0.266791)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.371256 / 0.215209 (0.156047) 3.763800 / 2.077655 (1.686146) 2.175321 / 1.504120 (0.671201) 1.943748 / 1.541195 (0.402554) 1.995813 / 1.468490 (0.527323) 7.261890 / 4.584777 (2.677113) 6.292724 / 3.745712 (2.547012) 9.012258 / 5.269862 (3.742396) 7.811911 / 4.565676 (3.246234) 0.727577 / 0.424275 (0.303302) 0.011226 / 0.007607 (0.003619) 0.424181 / 0.226044 (0.198137) 4.280841 / 2.268929 (2.011912) 2.706042 / 55.444624 (-52.738583) 2.335405 / 6.876477 (-4.541072) 2.405261 / 2.142072 (0.263189) 7.345899 / 4.805227 (2.540672) 5.113989 / 6.500664 (-1.386675) 8.007380 / 0.075469 (7.931911)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.600367 / 1.841788 (10.758580) 14.978017 / 8.074308 (6.903709) 25.415928 / 10.191392 (15.224536) 0.862584 / 0.680424 (0.182160) 0.627171 / 0.534201 (0.092970) 0.828762 / 0.579283 (0.249479) 0.657041 / 0.434364 (0.222677) 0.761937 / 0.540337 (0.221600) 1.661794 / 1.386936 (0.274858)

CML watermark

Please sign in to comment.