Skip to content

Commit

Permalink
Release: 2.8.0 (#5375)
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq authored Dec 19, 2022
1 parent 4403d06 commit 037c9b5
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -239,7 +239,7 @@

setup(
name="datasets",
version="2.7.1.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
version="2.8.0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
description="HuggingFace community-driven open-source library of datasets",
long_description=open("README.md", encoding="utf-8").read(),
long_description_content_type="text/markdown",
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# pylint: enable=line-too-long
# pylint: disable=g-import-not-at-top,g-bad-import-order,wrong-import-position

__version__ = "2.7.1.dev0"
__version__ = "2.8.0"

import platform

Expand Down

2 comments on commit 037c9b5

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008454 / 0.011353 (-0.002899) 0.004572 / 0.011008 (-0.006436) 0.100610 / 0.038508 (0.062102) 0.029775 / 0.023109 (0.006666) 0.321501 / 0.275898 (0.045603) 0.384631 / 0.323480 (0.061151) 0.006635 / 0.007986 (-0.001351) 0.004302 / 0.004328 (-0.000027) 0.076948 / 0.004250 (0.072698) 0.034590 / 0.037052 (-0.002462) 0.335476 / 0.258489 (0.076987) 0.370736 / 0.293841 (0.076895) 0.033903 / 0.128546 (-0.094644) 0.011713 / 0.075646 (-0.063934) 0.322657 / 0.419271 (-0.096615) 0.041238 / 0.043533 (-0.002295) 0.332796 / 0.255139 (0.077657) 0.356097 / 0.283200 (0.072897) 0.086596 / 0.141683 (-0.055086) 1.500251 / 1.452155 (0.048096) 1.506240 / 1.492716 (0.013523)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.182495 / 0.018006 (0.164488) 0.417243 / 0.000490 (0.416753) 0.003320 / 0.000200 (0.003120) 0.000078 / 0.000054 (0.000023)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022645 / 0.037411 (-0.014767) 0.096387 / 0.014526 (0.081861) 0.103309 / 0.176557 (-0.073247) 0.150951 / 0.737135 (-0.586184) 0.108085 / 0.296338 (-0.188253)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.418283 / 0.215209 (0.203074) 4.182318 / 2.077655 (2.104663) 1.905551 / 1.504120 (0.401431) 1.713482 / 1.541195 (0.172287) 1.723555 / 1.468490 (0.255065) 0.694402 / 4.584777 (-3.890375) 3.440837 / 3.745712 (-0.304875) 2.807788 / 5.269862 (-2.462074) 1.511065 / 4.565676 (-3.054611) 0.082896 / 0.424275 (-0.341379) 0.012826 / 0.007607 (0.005219) 0.534630 / 0.226044 (0.308586) 5.344232 / 2.268929 (3.075304) 2.343182 / 55.444624 (-53.101442) 1.993167 / 6.876477 (-4.883310) 2.028598 / 2.142072 (-0.113474) 0.813747 / 4.805227 (-3.991480) 0.150772 / 6.500664 (-6.349892) 0.065580 / 0.075469 (-0.009890)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.213075 / 1.841788 (-0.628713) 13.803558 / 8.074308 (5.729249) 13.861570 / 10.191392 (3.670178) 0.149327 / 0.680424 (-0.531097) 0.028580 / 0.534201 (-0.505621) 0.392459 / 0.579283 (-0.186824) 0.397675 / 0.434364 (-0.036689) 0.453008 / 0.540337 (-0.087330) 0.540258 / 1.386936 (-0.846678)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006427 / 0.011353 (-0.004926) 0.004501 / 0.011008 (-0.006507) 0.096841 / 0.038508 (0.058333) 0.027316 / 0.023109 (0.004206) 0.412767 / 0.275898 (0.136869) 0.444874 / 0.323480 (0.121394) 0.004679 / 0.007986 (-0.003307) 0.003359 / 0.004328 (-0.000970) 0.075986 / 0.004250 (0.071735) 0.036032 / 0.037052 (-0.001020) 0.416549 / 0.258489 (0.158060) 0.459225 / 0.293841 (0.165384) 0.032233 / 0.128546 (-0.096314) 0.011576 / 0.075646 (-0.064070) 0.317818 / 0.419271 (-0.101453) 0.043798 / 0.043533 (0.000266) 0.413053 / 0.255139 (0.157914) 0.439133 / 0.283200 (0.155933) 0.089591 / 0.141683 (-0.052092) 1.483676 / 1.452155 (0.031521) 1.551023 / 1.492716 (0.058307)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.229731 / 0.018006 (0.211725) 0.395366 / 0.000490 (0.394876) 0.002692 / 0.000200 (0.002492) 0.000089 / 0.000054 (0.000034)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024352 / 0.037411 (-0.013060) 0.098778 / 0.014526 (0.084252) 0.105660 / 0.176557 (-0.070897) 0.141284 / 0.737135 (-0.595852) 0.108269 / 0.296338 (-0.188069)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.477104 / 0.215209 (0.261895) 4.760421 / 2.077655 (2.682766) 2.446552 / 1.504120 (0.942432) 2.249353 / 1.541195 (0.708158) 2.275934 / 1.468490 (0.807444) 0.698203 / 4.584777 (-3.886574) 3.433769 / 3.745712 (-0.311943) 1.851560 / 5.269862 (-3.418301) 1.150512 / 4.565676 (-3.415165) 0.082462 / 0.424275 (-0.341813) 0.012642 / 0.007607 (0.005035) 0.581077 / 0.226044 (0.355033) 5.813649 / 2.268929 (3.544721) 2.924283 / 55.444624 (-52.520341) 2.573359 / 6.876477 (-4.303118) 2.640242 / 2.142072 (0.498169) 0.807003 / 4.805227 (-3.998225) 0.151566 / 6.500664 (-6.349098) 0.066132 / 0.075469 (-0.009337)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.286410 / 1.841788 (-0.555378) 13.628342 / 8.074308 (5.554034) 13.522056 / 10.191392 (3.330664) 0.156182 / 0.680424 (-0.524242) 0.016672 / 0.534201 (-0.517529) 0.403583 / 0.579283 (-0.175700) 0.388391 / 0.434364 (-0.045973) 0.493925 / 0.540337 (-0.046412) 0.587141 / 1.386936 (-0.799795)

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008491 / 0.011353 (-0.002861) 0.004528 / 0.011008 (-0.006480) 0.099632 / 0.038508 (0.061124) 0.029740 / 0.023109 (0.006631) 0.332183 / 0.275898 (0.056285) 0.386678 / 0.323480 (0.063198) 0.006894 / 0.007986 (-0.001091) 0.004037 / 0.004328 (-0.000292) 0.076546 / 0.004250 (0.072296) 0.034898 / 0.037052 (-0.002154) 0.320601 / 0.258489 (0.062112) 0.347314 / 0.293841 (0.053473) 0.033432 / 0.128546 (-0.095114) 0.011405 / 0.075646 (-0.064241) 0.323558 / 0.419271 (-0.095714) 0.041256 / 0.043533 (-0.002277) 0.346200 / 0.255139 (0.091061) 0.364110 / 0.283200 (0.080910) 0.088885 / 0.141683 (-0.052798) 1.529373 / 1.452155 (0.077219) 1.535621 / 1.492716 (0.042904)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.178494 / 0.018006 (0.160488) 0.411241 / 0.000490 (0.410752) 0.005426 / 0.000200 (0.005226) 0.000323 / 0.000054 (0.000269)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022539 / 0.037411 (-0.014872) 0.096906 / 0.014526 (0.082380) 0.103463 / 0.176557 (-0.073093) 0.138447 / 0.737135 (-0.598689) 0.107463 / 0.296338 (-0.188875)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.411911 / 0.215209 (0.196702) 4.138020 / 2.077655 (2.060366) 1.855576 / 1.504120 (0.351456) 1.646171 / 1.541195 (0.104976) 1.666003 / 1.468490 (0.197513) 0.694070 / 4.584777 (-3.890707) 3.395001 / 3.745712 (-0.350712) 2.869958 / 5.269862 (-2.399904) 1.530200 / 4.565676 (-3.035477) 0.082264 / 0.424275 (-0.342011) 0.012311 / 0.007607 (0.004704) 0.533955 / 0.226044 (0.307911) 5.383441 / 2.268929 (3.114512) 2.308334 / 55.444624 (-53.136291) 1.952901 / 6.876477 (-4.923576) 2.011146 / 2.142072 (-0.130926) 0.813210 / 4.805227 (-3.992017) 0.148559 / 6.500664 (-6.352105) 0.063885 / 0.075469 (-0.011585)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.204273 / 1.841788 (-0.637515) 13.639953 / 8.074308 (5.565645) 14.196351 / 10.191392 (4.004959) 0.159332 / 0.680424 (-0.521092) 0.028854 / 0.534201 (-0.505347) 0.393369 / 0.579283 (-0.185914) 0.395204 / 0.434364 (-0.039160) 0.449449 / 0.540337 (-0.090889) 0.535106 / 1.386936 (-0.851830)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006384 / 0.011353 (-0.004969) 0.004423 / 0.011008 (-0.006585) 0.096900 / 0.038508 (0.058392) 0.027035 / 0.023109 (0.003926) 0.440934 / 0.275898 (0.165036) 0.462096 / 0.323480 (0.138616) 0.004723 / 0.007986 (-0.003262) 0.003387 / 0.004328 (-0.000941) 0.076334 / 0.004250 (0.072083) 0.035805 / 0.037052 (-0.001248) 0.442706 / 0.258489 (0.184216) 0.479093 / 0.293841 (0.185252) 0.031356 / 0.128546 (-0.097190) 0.011358 / 0.075646 (-0.064288) 0.316216 / 0.419271 (-0.103056) 0.041401 / 0.043533 (-0.002132) 0.437998 / 0.255139 (0.182859) 0.456859 / 0.283200 (0.173659) 0.086862 / 0.141683 (-0.054821) 1.461058 / 1.452155 (0.008904) 1.528178 / 1.492716 (0.035462)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.201864 / 0.018006 (0.183857) 0.397832 / 0.000490 (0.397343) 0.000401 / 0.000200 (0.000201) 0.000058 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024469 / 0.037411 (-0.012943) 0.100313 / 0.014526 (0.085787) 0.107322 / 0.176557 (-0.069235) 0.146324 / 0.737135 (-0.590812) 0.109628 / 0.296338 (-0.186710)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.474605 / 0.215209 (0.259396) 4.757606 / 2.077655 (2.679951) 2.461358 / 1.504120 (0.957238) 2.256189 / 1.541195 (0.714994) 2.259431 / 1.468490 (0.790941) 0.694325 / 4.584777 (-3.890452) 3.339407 / 3.745712 (-0.406305) 1.845878 / 5.269862 (-3.423984) 1.153192 / 4.565676 (-3.412484) 0.082525 / 0.424275 (-0.341750) 0.012521 / 0.007607 (0.004914) 0.573641 / 0.226044 (0.347596) 5.782644 / 2.268929 (3.513715) 2.908973 / 55.444624 (-52.535652) 2.561331 / 6.876477 (-4.315146) 2.643789 / 2.142072 (0.501716) 0.808981 / 4.805227 (-3.996246) 0.153595 / 6.500664 (-6.347069) 0.067294 / 0.075469 (-0.008175)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.267606 / 1.841788 (-0.574182) 13.748027 / 8.074308 (5.673718) 13.411094 / 10.191392 (3.219702) 0.151850 / 0.680424 (-0.528574) 0.016441 / 0.534201 (-0.517760) 0.395012 / 0.579283 (-0.184271) 0.383720 / 0.434364 (-0.050644) 0.478625 / 0.540337 (-0.061712) 0.572280 / 1.386936 (-0.814656)

Please sign in to comment.