Skip to content

Commit

Permalink
Remove unnecessary condition in csv module
Browse files Browse the repository at this point in the history
  • Loading branch information
albertvillanova committed Dec 14, 2021
1 parent 5a32cf7 commit 88b1c5d
Showing 1 changed file with 2 additions and 4 deletions.
6 changes: 2 additions & 4 deletions src/datasets/packaged_modules/csv/csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,15 +139,13 @@ def _split_generators(self, dl_manager):
files = data_files
if isinstance(files, str):
files = [files]
if any(os.path.isdir(file) for file in files):
files = [file for file in dl_manager.iter_files(files)]
files = [file for file in dl_manager.iter_files(files)]
return [datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"files": files})]
splits = []
for split_name, files in data_files.items():
if isinstance(files, str):
files = [files]
if any(os.path.isdir(file) for file in files):
files = [file for file in dl_manager.iter_files(files)]
files = [file for file in dl_manager.iter_files(files)]
splits.append(datasets.SplitGenerator(name=split_name, gen_kwargs={"files": files}))
return splits

Expand Down

1 comment on commit 88b1c5d

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.011734 / 0.011353 (0.000381) 0.005036 / 0.011008 (-0.005972) 0.038822 / 0.038508 (0.000314) 0.041606 / 0.023109 (0.018497) 0.411094 / 0.275898 (0.135196) 0.434435 / 0.323480 (0.110955) 0.009873 / 0.007986 (0.001887) 0.006813 / 0.004328 (0.002485) 0.011295 / 0.004250 (0.007044) 0.053291 / 0.037052 (0.016239) 0.401438 / 0.258489 (0.142949) 0.439770 / 0.293841 (0.145929) 0.036646 / 0.128546 (-0.091900) 0.010699 / 0.075646 (-0.064947) 0.312099 / 0.419271 (-0.107172) 0.058888 / 0.043533 (0.015355) 0.399660 / 0.255139 (0.144521) 0.451313 / 0.283200 (0.168113) 0.100267 / 0.141683 (-0.041416) 2.108117 / 1.452155 (0.655963) 2.214008 / 1.492716 (0.721292)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.309060 / 0.018006 (0.291053) 0.537916 / 0.000490 (0.537426) 0.003882 / 0.000200 (0.003682) 0.000140 / 0.000054 (0.000086)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.043702 / 0.037411 (0.006291) 0.025672 / 0.014526 (0.011146) 0.039521 / 0.176557 (-0.137036) 0.085418 / 0.737135 (-0.651717) 0.040565 / 0.296338 (-0.255774)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.516386 / 0.215209 (0.301177) 5.198324 / 2.077655 (3.120669) 2.383135 / 1.504120 (0.879015) 2.155045 / 1.541195 (0.613850) 2.210662 / 1.468490 (0.742172) 0.531832 / 4.584777 (-4.052945) 5.741158 / 3.745712 (1.995446) 4.208295 / 5.269862 (-1.061567) 1.080032 / 4.565676 (-3.485644) 0.064475 / 0.424275 (-0.359800) 0.014542 / 0.007607 (0.006935) 0.647845 / 0.226044 (0.421801) 6.538639 / 2.268929 (4.269710) 3.007478 / 55.444624 (-52.437146) 2.441712 / 6.876477 (-4.434765) 2.463618 / 2.142072 (0.321545) 0.668499 / 4.805227 (-4.136728) 0.147373 / 6.500664 (-6.353292) 0.073710 / 0.075469 (-0.001759)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.883932 / 1.841788 (0.042145) 14.632109 / 8.074308 (6.557801) 31.195233 / 10.191392 (21.003841) 0.866089 / 0.680424 (0.185665) 0.630523 / 0.534201 (0.096322) 0.593512 / 0.579283 (0.014229) 0.624010 / 0.434364 (0.189646) 0.376855 / 0.540337 (-0.163483) 0.387812 / 1.386936 (-0.999124)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009665 / 0.011353 (-0.001687) 0.004559 / 0.011008 (-0.006449) 0.034920 / 0.038508 (-0.003588) 0.039719 / 0.023109 (0.016610) 0.366152 / 0.275898 (0.090254) 0.381764 / 0.323480 (0.058284) 0.007589 / 0.007986 (-0.000397) 0.005572 / 0.004328 (0.001243) 0.008750 / 0.004250 (0.004499) 0.048986 / 0.037052 (0.011934) 0.360835 / 0.258489 (0.102346) 0.390626 / 0.293841 (0.096786) 0.035241 / 0.128546 (-0.093305) 0.010508 / 0.075646 (-0.065138) 0.301476 / 0.419271 (-0.117796) 0.057652 / 0.043533 (0.014119) 0.352323 / 0.255139 (0.097184) 0.380280 / 0.283200 (0.097081) 0.093966 / 0.141683 (-0.047716) 2.085569 / 1.452155 (0.633415) 2.169071 / 1.492716 (0.676355)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.315898 / 0.018006 (0.297892) 0.541015 / 0.000490 (0.540525) 0.007338 / 0.000200 (0.007138) 0.000313 / 0.000054 (0.000259)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.040231 / 0.037411 (0.002819) 0.025254 / 0.014526 (0.010728) 0.033276 / 0.176557 (-0.143280) 0.088914 / 0.737135 (-0.648222) 0.036414 / 0.296338 (-0.259924)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.495980 / 0.215209 (0.280771) 4.986007 / 2.077655 (2.908352) 2.116416 / 1.504120 (0.612297) 1.866997 / 1.541195 (0.325802) 1.961187 / 1.468490 (0.492697) 0.523904 / 4.584777 (-4.060873) 5.741138 / 3.745712 (1.995426) 2.607832 / 5.269862 (-2.662029) 1.078122 / 4.565676 (-3.487555) 0.063914 / 0.424275 (-0.360361) 0.014756 / 0.007607 (0.007149) 0.630092 / 0.226044 (0.404048) 6.286572 / 2.268929 (4.017643) 2.676798 / 55.444624 (-52.767826) 2.221940 / 6.876477 (-4.654537) 2.285439 / 2.142072 (0.143367) 0.664837 / 4.805227 (-4.140390) 0.144949 / 6.500664 (-6.355716) 0.071889 / 0.075469 (-0.003581)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.851312 / 1.841788 (0.009524) 14.206022 / 8.074308 (6.131714) 32.029078 / 10.191392 (21.837686) 0.898752 / 0.680424 (0.218328) 0.627135 / 0.534201 (0.092934) 0.585815 / 0.579283 (0.006532) 0.620676 / 0.434364 (0.186312) 0.376380 / 0.540337 (-0.163957) 0.389399 / 1.386936 (-0.997537)

CML watermark

Please sign in to comment.