Skip to content

Commit

Permalink
Release: 2.6.1
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Oct 14, 2022
1 parent eadc79a commit 1742cf1
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@

setup(
name="datasets",
version="2.6.1.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
version="2.6.1", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
description="HuggingFace community-driven open-source library of datasets",
long_description=open("README.md", encoding="utf-8").read(),
long_description_content_type="text/markdown",
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# pylint: enable=line-too-long
# pylint: disable=g-import-not-at-top,g-bad-import-order,wrong-import-position

__version__ = "2.6.1.dev0"
__version__ = "2.6.1"

import platform

Expand Down

2 comments on commit 1742cf1

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008246 / 0.011353 (-0.003107) 0.004152 / 0.011008 (-0.006856) 0.098156 / 0.038508 (0.059648) 0.029122 / 0.023109 (0.006012) 0.298711 / 0.275898 (0.022813) 0.361961 / 0.323480 (0.038481) 0.006818 / 0.007986 (-0.001167) 0.003180 / 0.004328 (-0.001148) 0.076503 / 0.004250 (0.072253) 0.036090 / 0.037052 (-0.000963) 0.309586 / 0.258489 (0.051097) 0.348565 / 0.293841 (0.054724) 0.038141 / 0.128546 (-0.090405) 0.014282 / 0.075646 (-0.061364) 0.324653 / 0.419271 (-0.094618) 0.046897 / 0.043533 (0.003364) 0.303105 / 0.255139 (0.047966) 0.339253 / 0.283200 (0.056053) 0.090930 / 0.141683 (-0.050753) 1.507281 / 1.452155 (0.055127) 1.523072 / 1.492716 (0.030356)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.189615 / 0.018006 (0.171609) 0.397685 / 0.000490 (0.397195) 0.003871 / 0.000200 (0.003671) 0.000073 / 0.000054 (0.000019)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.020437 / 0.037411 (-0.016974) 0.088297 / 0.014526 (0.073771) 0.101371 / 0.176557 (-0.075185) 0.143715 / 0.737135 (-0.593420) 0.099325 / 0.296338 (-0.197014)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.412526 / 0.215209 (0.197317) 4.110139 / 2.077655 (2.032485) 1.848033 / 1.504120 (0.343913) 1.648373 / 1.541195 (0.107179) 1.649734 / 1.468490 (0.181244) 0.682876 / 4.584777 (-3.901901) 3.335711 / 3.745712 (-0.410001) 2.727744 / 5.269862 (-2.542117) 1.505338 / 4.565676 (-3.060339) 0.080269 / 0.424275 (-0.344006) 0.011500 / 0.007607 (0.003893) 0.519604 / 0.226044 (0.293560) 5.174052 / 2.268929 (2.905123) 2.255864 / 55.444624 (-53.188760) 1.926567 / 6.876477 (-4.949909) 1.994304 / 2.142072 (-0.147769) 0.805081 / 4.805227 (-4.000147) 0.146588 / 6.500664 (-6.354076) 0.063846 / 0.075469 (-0.011623)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.512042 / 1.841788 (-0.329745) 12.058384 / 8.074308 (3.984076) 26.085560 / 10.191392 (15.894168) 0.873375 / 0.680424 (0.192952) 0.596758 / 0.534201 (0.062557) 0.382937 / 0.579283 (-0.196346) 0.389429 / 0.434364 (-0.044935) 0.228573 / 0.540337 (-0.311765) 0.232618 / 1.386936 (-1.154318)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006234 / 0.011353 (-0.005118) 0.004224 / 0.011008 (-0.006784) 0.095976 / 0.038508 (0.057468) 0.027015 / 0.023109 (0.003906) 0.416027 / 0.275898 (0.140129) 0.444594 / 0.323480 (0.121114) 0.004554 / 0.007986 (-0.003432) 0.003191 / 0.004328 (-0.001138) 0.073894 / 0.004250 (0.069643) 0.031782 / 0.037052 (-0.005271) 0.420661 / 0.258489 (0.162172) 0.455066 / 0.293841 (0.161225) 0.031466 / 0.128546 (-0.097081) 0.011548 / 0.075646 (-0.064098) 0.322864 / 0.419271 (-0.096408) 0.041907 / 0.043533 (-0.001626) 0.420966 / 0.255139 (0.165827) 0.440870 / 0.283200 (0.157671) 0.083029 / 0.141683 (-0.058654) 1.472407 / 1.452155 (0.020252) 1.543090 / 1.492716 (0.050373)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.255509 / 0.018006 (0.237503) 0.388711 / 0.000490 (0.388221) 0.012210 / 0.000200 (0.012010) 0.000235 / 0.000054 (0.000181)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.020174 / 0.037411 (-0.017238) 0.092161 / 0.014526 (0.077635) 0.101638 / 0.176557 (-0.074919) 0.136823 / 0.737135 (-0.600312) 0.101699 / 0.296338 (-0.194639)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.475576 / 0.215209 (0.260367) 4.750959 / 2.077655 (2.673304) 2.464367 / 1.504120 (0.960247) 2.263827 / 1.541195 (0.722632) 2.218332 / 1.468490 (0.749842) 0.691144 / 4.584777 (-3.893633) 3.309536 / 3.745712 (-0.436176) 1.811405 / 5.269862 (-3.458457) 1.133598 / 4.565676 (-3.432079) 0.081501 / 0.424275 (-0.342774) 0.011945 / 0.007607 (0.004338) 0.571250 / 0.226044 (0.345206) 5.732415 / 2.268929 (3.463486) 2.847106 / 55.444624 (-52.597518) 2.509615 / 6.876477 (-4.366862) 2.566388 / 2.142072 (0.424316) 0.791163 / 4.805227 (-4.014064) 0.146978 / 6.500664 (-6.353687) 0.064211 / 0.075469 (-0.011258)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.590000 / 1.841788 (-0.251787) 12.148056 / 8.074308 (4.073748) 11.499273 / 10.191392 (1.307881) 0.932055 / 0.680424 (0.251631) 0.646575 / 0.534201 (0.112374) 0.371758 / 0.579283 (-0.207525) 0.372813 / 0.434364 (-0.061551) 0.220938 / 0.540337 (-0.319399) 0.224515 / 1.386936 (-1.162421)

CML watermark

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.011467 / 0.011353 (0.000114) 0.006177 / 0.011008 (-0.004831) 0.118735 / 0.038508 (0.080227) 0.044718 / 0.023109 (0.021608) 0.351433 / 0.275898 (0.075535) 0.434193 / 0.323480 (0.110713) 0.009590 / 0.007986 (0.001604) 0.004827 / 0.004328 (0.000499) 0.088079 / 0.004250 (0.083828) 0.051715 / 0.037052 (0.014663) 0.362979 / 0.258489 (0.104490) 0.405181 / 0.293841 (0.111340) 0.050062 / 0.128546 (-0.078484) 0.017890 / 0.075646 (-0.057757) 0.406788 / 0.419271 (-0.012484) 0.060269 / 0.043533 (0.016736) 0.354621 / 0.255139 (0.099482) 0.385962 / 0.283200 (0.102763) 0.123475 / 0.141683 (-0.018208) 1.727626 / 1.452155 (0.275471) 1.803855 / 1.492716 (0.311139)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.021680 / 0.018006 (0.003674) 0.503381 / 0.000490 (0.502891) 0.005957 / 0.000200 (0.005757) 0.000402 / 0.000054 (0.000348)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028242 / 0.037411 (-0.009169) 0.124078 / 0.014526 (0.109552) 0.135683 / 0.176557 (-0.040874) 0.184361 / 0.737135 (-0.552774) 0.140354 / 0.296338 (-0.155985)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.480742 / 0.215209 (0.265533) 4.779373 / 2.077655 (2.701719) 2.165167 / 1.504120 (0.661047) 1.929963 / 1.541195 (0.388769) 2.011202 / 1.468490 (0.542712) 0.836269 / 4.584777 (-3.748508) 4.518374 / 3.745712 (0.772662) 2.432428 / 5.269862 (-2.837433) 1.666226 / 4.565676 (-2.899451) 0.101863 / 0.424275 (-0.322412) 0.014412 / 0.007607 (0.006805) 0.602609 / 0.226044 (0.376564) 6.028526 / 2.268929 (3.759598) 2.661940 / 55.444624 (-52.782684) 2.293240 / 6.876477 (-4.583237) 2.480830 / 2.142072 (0.338758) 1.021821 / 4.805227 (-3.783406) 0.197480 / 6.500664 (-6.303184) 0.072843 / 0.075469 (-0.002626)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.857247 / 1.841788 (0.015459) 17.402325 / 8.074308 (9.328017) 29.619062 / 10.191392 (19.427670) 1.070596 / 0.680424 (0.390172) 0.689079 / 0.534201 (0.154878) 0.528033 / 0.579283 (-0.051250) 0.645106 / 0.434364 (0.210742) 0.425159 / 0.540337 (-0.115178) 0.333632 / 1.386936 (-1.053304)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009216 / 0.011353 (-0.002137) 0.006193 / 0.011008 (-0.004815) 0.116069 / 0.038508 (0.077561) 0.043686 / 0.023109 (0.020577) 0.473225 / 0.275898 (0.197327) 0.513695 / 0.323480 (0.190215) 0.007081 / 0.007986 (-0.000904) 0.006343 / 0.004328 (0.002014) 0.085387 / 0.004250 (0.081137) 0.047534 / 0.037052 (0.010482) 0.474732 / 0.258489 (0.216242) 0.546172 / 0.293841 (0.252331) 0.058086 / 0.128546 (-0.070460) 0.014787 / 0.075646 (-0.060859) 0.399874 / 0.419271 (-0.019398) 0.058804 / 0.043533 (0.015271) 0.474372 / 0.255139 (0.219233) 0.494759 / 0.283200 (0.211560) 0.118139 / 0.141683 (-0.023544) 1.785980 / 1.452155 (0.333826) 1.890910 / 1.492716 (0.398194)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.267187 / 0.018006 (0.249181) 0.506637 / 0.000490 (0.506147) 0.010951 / 0.000200 (0.010751) 0.000160 / 0.000054 (0.000106)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028019 / 0.037411 (-0.009392) 0.127888 / 0.014526 (0.113362) 0.138623 / 0.176557 (-0.037934) 0.197707 / 0.737135 (-0.539428) 0.143614 / 0.296338 (-0.152725)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.514741 / 0.215209 (0.299532) 5.128504 / 2.077655 (3.050849) 2.634402 / 1.504120 (1.130282) 2.440124 / 1.541195 (0.898929) 2.456171 / 1.468490 (0.987681) 0.856596 / 4.584777 (-3.728181) 4.911497 / 3.745712 (1.165785) 4.403897 / 5.269862 (-0.865965) 2.397983 / 4.565676 (-2.167694) 0.103959 / 0.424275 (-0.320317) 0.014452 / 0.007607 (0.006845) 0.636030 / 0.226044 (0.409986) 6.332770 / 2.268929 (4.063841) 3.073837 / 55.444624 (-52.370787) 2.762473 / 6.876477 (-4.114003) 2.847753 / 2.142072 (0.705680) 1.027539 / 4.805227 (-3.777688) 0.201736 / 6.500664 (-6.298928) 0.074734 / 0.075469 (-0.000736)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.868321 / 1.841788 (0.026534) 17.555944 / 8.074308 (9.481636) 13.829502 / 10.191392 (3.638110) 1.114763 / 0.680424 (0.434339) 0.730613 / 0.534201 (0.196412) 0.500288 / 0.579283 (-0.078995) 0.547529 / 0.434364 (0.113165) 0.301757 / 0.540337 (-0.238580) 0.306784 / 1.386936 (-1.080152)

CML watermark

Please sign in to comment.