Skip to content

Commit

Permalink
Update src/datasets/builder.py
Browse files Browse the repository at this point in the history
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
  • Loading branch information
mariosasko and lhoestq authored Dec 9, 2022
1 parent 74b3cad commit 631f3f9
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/datasets/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -723,7 +723,7 @@ def download_and_prepare(
force_download=bool(download_mode == DownloadMode.FORCE_REDOWNLOAD),
force_extract=bool(download_mode == DownloadMode.FORCE_REDOWNLOAD),
use_etag=False,
use_auth_token=self.use_auth_token,
use_auth_token=use_auth_token,
) # We don't use etag for data files to speed up the process

dl_manager = DownloadManager(
Expand Down

1 comment on commit 631f3f9

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010336 / 0.011353 (-0.001017) 0.006159 / 0.011008 (-0.004849) 0.099798 / 0.038508 (0.061290) 0.038199 / 0.023109 (0.015090) 0.299328 / 0.275898 (0.023430) 0.391203 / 0.323480 (0.067723) 0.010338 / 0.007986 (0.002352) 0.005952 / 0.004328 (0.001624) 0.077807 / 0.004250 (0.073557) 0.057411 / 0.037052 (0.020358) 0.322980 / 0.258489 (0.064491) 0.364073 / 0.293841 (0.070232) 0.043557 / 0.128546 (-0.084990) 0.016062 / 0.075646 (-0.059584) 0.339421 / 0.419271 (-0.079850) 0.054001 / 0.043533 (0.010469) 0.293897 / 0.255139 (0.038758) 0.317751 / 0.283200 (0.034552) 0.134035 / 0.141683 (-0.007648) 1.496749 / 1.452155 (0.044595) 1.540542 / 1.492716 (0.047826)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.314576 / 0.018006 (0.296570) 0.754708 / 0.000490 (0.754218) 0.004536 / 0.000200 (0.004336) 0.000235 / 0.000054 (0.000180)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034462 / 0.037411 (-0.002950) 0.113430 / 0.014526 (0.098904) 0.132895 / 0.176557 (-0.043661) 0.179486 / 0.737135 (-0.557649) 0.136934 / 0.296338 (-0.159404)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.393500 / 0.215209 (0.178291) 3.926156 / 2.077655 (1.848502) 1.783734 / 1.504120 (0.279614) 1.597872 / 1.541195 (0.056678) 1.726551 / 1.468490 (0.258061) 0.687755 / 4.584777 (-3.897022) 3.806071 / 3.745712 (0.060359) 4.123191 / 5.269862 (-1.146671) 2.048353 / 4.565676 (-2.517323) 0.083711 / 0.424275 (-0.340564) 0.012384 / 0.007607 (0.004776) 0.498734 / 0.226044 (0.272690) 4.961611 / 2.268929 (2.692683) 2.285352 / 55.444624 (-53.159273) 1.938114 / 6.876477 (-4.938363) 2.225342 / 2.142072 (0.083269) 0.840904 / 4.805227 (-3.964323) 0.167580 / 6.500664 (-6.333084) 0.064180 / 0.075469 (-0.011289)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.526015 / 1.841788 (-0.315773) 15.923362 / 8.074308 (7.849054) 26.080619 / 10.191392 (15.889227) 0.890602 / 0.680424 (0.210179) 0.578119 / 0.534201 (0.043918) 0.441936 / 0.579283 (-0.137347) 0.440066 / 0.434364 (0.005702) 0.286783 / 0.540337 (-0.253554) 0.280697 / 1.386936 (-1.106239)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008702 / 0.011353 (-0.002651) 0.006132 / 0.011008 (-0.004876) 0.100296 / 0.038508 (0.061788) 0.037639 / 0.023109 (0.014529) 0.381832 / 0.275898 (0.105934) 0.429287 / 0.323480 (0.105807) 0.008158 / 0.007986 (0.000173) 0.006184 / 0.004328 (0.001856) 0.075983 / 0.004250 (0.071733) 0.052959 / 0.037052 (0.015907) 0.387726 / 0.258489 (0.129237) 0.453568 / 0.293841 (0.159727) 0.039825 / 0.128546 (-0.088722) 0.013473 / 0.075646 (-0.062174) 0.337153 / 0.419271 (-0.082118) 0.060904 / 0.043533 (0.017371) 0.372106 / 0.255139 (0.116967) 0.407117 / 0.283200 (0.123918) 0.141493 / 0.141683 (-0.000190) 1.556980 / 1.452155 (0.104825) 1.668023 / 1.492716 (0.175307)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.338603 / 0.018006 (0.320597) 0.750997 / 0.000490 (0.750508) 0.005248 / 0.000200 (0.005048) 0.000109 / 0.000054 (0.000054)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034599 / 0.037411 (-0.002813) 0.113425 / 0.014526 (0.098899) 0.139529 / 0.176557 (-0.037028) 0.180962 / 0.737135 (-0.556174) 0.142556 / 0.296338 (-0.153783)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.435923 / 0.215209 (0.220714) 4.337905 / 2.077655 (2.260251) 2.171462 / 1.504120 (0.667342) 1.989323 / 1.541195 (0.448129) 2.059853 / 1.468490 (0.591363) 0.710403 / 4.584777 (-3.874374) 3.818660 / 3.745712 (0.072948) 2.244438 / 5.269862 (-3.025423) 1.394937 / 4.565676 (-3.170740) 0.086515 / 0.424275 (-0.337760) 0.012285 / 0.007607 (0.004678) 0.537832 / 0.226044 (0.311788) 5.391222 / 2.268929 (3.122293) 2.684573 / 55.444624 (-52.760051) 2.329362 / 6.876477 (-4.547115) 2.611297 / 2.142072 (0.469224) 0.858540 / 4.805227 (-3.946688) 0.171914 / 6.500664 (-6.328750) 0.065268 / 0.075469 (-0.010201)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.522850 / 1.841788 (-0.318937) 16.775622 / 8.074308 (8.701313) 12.401437 / 10.191392 (2.210045) 0.934975 / 0.680424 (0.254551) 0.596320 / 0.534201 (0.062119) 0.421365 / 0.579283 (-0.157918) 0.423118 / 0.434364 (-0.011246) 0.247850 / 0.540337 (-0.292487) 0.272727 / 1.386936 (-1.114209)

Please sign in to comment.