Skip to content

Commit

Permalink
lucain's comment
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Sep 29, 2022
1 parent 136a09a commit 82554fb
Showing 1 changed file with 6 additions and 4 deletions.
10 changes: 6 additions & 4 deletions src/datasets/utils/_hf_hub_fixes.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,10 +111,11 @@ def dataset_info(
use_auth_token: Optional[Union[bool, str]] = None,
) -> DatasetInfo:
"""
Get info on one specific dataset on huggingface.co.
Dataset can be private if you pass an acceptable token.
The huggingface_hub.HfApi.dataset_info parameters changed in 0.10.0 and some of them were deprecated.
This function checks the huggingface_hub version to call the right parameters.
Args:
hf_api (`huggingface_hub.HfApi`): Hub client
hf_api (`huggingface_hub.HfApi`): Hub client
repo_id (`str`):
A namespace (user or an organization) and a repo name separated
by a `/`.
Expand Down Expand Up @@ -164,7 +165,8 @@ def list_repo_files(
timeout: Optional[float] = None,
) -> List[str]:
"""
Get the list of files in a given repo.
The huggingface_hub.HfApi.list_repo_files parameters changed in 0.10.0 and some of them were deprecated.
This function checks the huggingface_hub version to call the right parameters.
"""
if version.parse(huggingface_hub.__version__) < version.parse("0.10.0"):
return hf_api.list_repo_files(repo_id, revision=revision, repo_type=repo_type, token=token, timeout=timeout)
Expand Down

1 comment on commit 82554fb

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007956 / 0.011353 (-0.003397) 0.004046 / 0.011008 (-0.006962) 0.030597 / 0.038508 (-0.007911) 0.035513 / 0.023109 (0.012404) 0.297088 / 0.275898 (0.021190) 0.366172 / 0.323480 (0.042692) 0.006107 / 0.007986 (-0.001878) 0.003633 / 0.004328 (-0.000696) 0.007102 / 0.004250 (0.002852) 0.048490 / 0.037052 (0.011438) 0.307725 / 0.258489 (0.049236) 0.348017 / 0.293841 (0.054176) 0.031087 / 0.128546 (-0.097459) 0.009668 / 0.075646 (-0.065978) 0.265952 / 0.419271 (-0.153319) 0.051282 / 0.043533 (0.007749) 0.296371 / 0.255139 (0.041232) 0.318646 / 0.283200 (0.035447) 0.109644 / 0.141683 (-0.032039) 1.467956 / 1.452155 (0.015802) 1.491340 / 1.492716 (-0.001377)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.286003 / 0.018006 (0.267997) 0.628553 / 0.000490 (0.628063) 0.001282 / 0.000200 (0.001082) 0.000108 / 0.000054 (0.000054)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023018 / 0.037411 (-0.014394) 0.104153 / 0.014526 (0.089627) 0.114062 / 0.176557 (-0.062495) 0.152784 / 0.737135 (-0.584352) 0.118304 / 0.296338 (-0.178035)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.397989 / 0.215209 (0.182780) 3.971725 / 2.077655 (1.894071) 1.813339 / 1.504120 (0.309219) 1.624952 / 1.541195 (0.083757) 1.685922 / 1.468490 (0.217431) 0.417372 / 4.584777 (-4.167405) 3.672589 / 3.745712 (-0.073124) 2.043461 / 5.269862 (-3.226401) 1.376118 / 4.565676 (-3.189558) 0.050446 / 0.424275 (-0.373829) 0.010897 / 0.007607 (0.003290) 0.503157 / 0.226044 (0.277113) 5.014016 / 2.268929 (2.745088) 2.286126 / 55.444624 (-53.158499) 1.930614 / 6.876477 (-4.945863) 2.082430 / 2.142072 (-0.059643) 0.545462 / 4.805227 (-4.259765) 0.117862 / 6.500664 (-6.382802) 0.060592 / 0.075469 (-0.014877)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.487483 / 1.841788 (-0.354304) 14.313931 / 8.074308 (6.239622) 25.566179 / 10.191392 (15.374787) 0.840867 / 0.680424 (0.160443) 0.555116 / 0.534201 (0.020915) 0.390299 / 0.579283 (-0.188984) 0.441661 / 0.434364 (0.007297) 0.275077 / 0.540337 (-0.265261) 0.281289 / 1.386936 (-1.105647)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006196 / 0.011353 (-0.005157) 0.004279 / 0.011008 (-0.006729) 0.028134 / 0.038508 (-0.010375) 0.034148 / 0.023109 (0.011039) 0.378770 / 0.275898 (0.102872) 0.451004 / 0.323480 (0.127524) 0.004199 / 0.007986 (-0.003787) 0.005026 / 0.004328 (0.000698) 0.005111 / 0.004250 (0.000860) 0.042473 / 0.037052 (0.005421) 0.381867 / 0.258489 (0.123378) 0.430145 / 0.293841 (0.136304) 0.031011 / 0.128546 (-0.097535) 0.009883 / 0.075646 (-0.065763) 0.258123 / 0.419271 (-0.161148) 0.066289 / 0.043533 (0.022757) 0.374110 / 0.255139 (0.118971) 0.394744 / 0.283200 (0.111544) 0.115659 / 0.141683 (-0.026024) 1.476616 / 1.452155 (0.024461) 1.537088 / 1.492716 (0.044372)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.314258 / 0.018006 (0.296252) 0.521715 / 0.000490 (0.521225) 0.003653 / 0.000200 (0.003453) 0.000100 / 0.000054 (0.000045)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024812 / 0.037411 (-0.012599) 0.103187 / 0.014526 (0.088661) 0.115937 / 0.176557 (-0.060620) 0.167195 / 0.737135 (-0.569941) 0.121302 / 0.296338 (-0.175036)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.421377 / 0.215209 (0.206167) 4.208804 / 2.077655 (2.131149) 1.997037 / 1.504120 (0.492917) 1.811746 / 1.541195 (0.270552) 1.910478 / 1.468490 (0.441988) 0.421486 / 4.584777 (-4.163291) 3.774028 / 3.745712 (0.028316) 3.396143 / 5.269862 (-1.873719) 1.381649 / 4.565676 (-3.184027) 0.051394 / 0.424275 (-0.372881) 0.011382 / 0.007607 (0.003774) 0.519081 / 0.226044 (0.293037) 5.221945 / 2.268929 (2.953017) 2.462730 / 55.444624 (-52.981895) 2.119785 / 6.876477 (-4.756692) 2.305585 / 2.142072 (0.163512) 0.529451 / 4.805227 (-4.275776) 0.120795 / 6.500664 (-6.379869) 0.062546 / 0.075469 (-0.012923)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.535732 / 1.841788 (-0.306056) 14.172269 / 8.074308 (6.097961) 25.258830 / 10.191392 (15.067438) 0.919592 / 0.680424 (0.239169) 0.620956 / 0.534201 (0.086755) 0.387869 / 0.579283 (-0.191414) 0.429871 / 0.434364 (-0.004493) 0.267752 / 0.540337 (-0.272586) 0.277576 / 1.386936 (-1.109360)

CML watermark

Please sign in to comment.