datasets.filesystems: fix is_remote_filesystems #6334

ap-- · 2023-10-23T09:17:54Z

Close #6330, close #6333.

fsspec.implementations.LocalFilesystem.protocol
was changed from str "file" to tuple[str,...] ("file", "local") in fsspec>=2023.10.0

This commit supports both styles.

Close huggingface#6330 `fsspec.implementations.LocalFilesystem.protocol` was changed from `str` "file" to `tuple[str,...]` ("file", "local") in `fsspec>=2023.10.0` This commit supports both styles.

HuggingFaceDocBuilderDev · 2023-10-23T09:26:36Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Thanks !

github-actions · 2023-10-23T10:22:51Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006648 / 0.011353 (-0.004705)	0.004104 / 0.011008 (-0.006904)	0.084718 / 0.038508 (0.046210)	0.075342 / 0.023109 (0.052232)	0.332624 / 0.275898 (0.056726)	0.376758 / 0.323480 (0.053278)	0.005371 / 0.007986 (-0.002614)	0.003317 / 0.004328 (-0.001011)	0.065153 / 0.004250 (0.060902)	0.055270 / 0.037052 (0.018218)	0.342410 / 0.258489 (0.083920)	0.397484 / 0.293841 (0.103643)	0.031168 / 0.128546 (-0.097379)	0.008545 / 0.075646 (-0.067101)	0.297641 / 0.419271 (-0.121631)	0.052404 / 0.043533 (0.008871)	0.327633 / 0.255139 (0.072494)	0.362177 / 0.283200 (0.078977)	0.025056 / 0.141683 (-0.116627)	1.459023 / 1.452155 (0.006868)	1.529651 / 1.492716 (0.036935)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.242838 / 0.018006 (0.224832)	0.451007 / 0.000490 (0.450517)	0.013732 / 0.000200 (0.013532)	0.000345 / 0.000054 (0.000290)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028068 / 0.037411 (-0.009343)	0.081970 / 0.014526 (0.067444)	0.096148 / 0.176557 (-0.080409)	0.151758 / 0.737135 (-0.585377)	0.095617 / 0.296338 (-0.200721)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.389188 / 0.215209 (0.173979)	3.867506 / 2.077655 (1.789852)	1.941912 / 1.504120 (0.437792)	1.759270 / 1.541195 (0.218076)	1.774714 / 1.468490 (0.306224)	0.476587 / 4.584777 (-4.108190)	3.539342 / 3.745712 (-0.206370)	3.434389 / 5.269862 (-1.835472)	2.047581 / 4.565676 (-2.518096)	0.056322 / 0.424275 (-0.367954)	0.007286 / 0.007607 (-0.000321)	0.461826 / 0.226044 (0.235781)	4.604179 / 2.268929 (2.335251)	2.405267 / 55.444624 (-53.039357)	2.133998 / 6.876477 (-4.742479)	2.187724 / 2.142072 (0.045652)	0.566578 / 4.805227 (-4.238650)	0.130007 / 6.500664 (-6.370657)	0.059685 / 0.075469 (-0.015784)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.256204 / 1.841788 (-0.585584)	18.829475 / 8.074308 (10.755167)	13.937879 / 10.191392 (3.746487)	0.163948 / 0.680424 (-0.516475)	0.018118 / 0.534201 (-0.516083)	0.389369 / 0.579283 (-0.189914)	0.399988 / 0.434364 (-0.034376)	0.459504 / 0.540337 (-0.080834)	0.674696 / 1.386936 (-0.712240)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006806 / 0.011353 (-0.004547)	0.004103 / 0.011008 (-0.006905)	0.064477 / 0.038508 (0.025969)	0.079514 / 0.023109 (0.056405)	0.391657 / 0.275898 (0.115759)	0.422997 / 0.323480 (0.099517)	0.005485 / 0.007986 (-0.002501)	0.003461 / 0.004328 (-0.000868)	0.064621 / 0.004250 (0.060371)	0.057686 / 0.037052 (0.020633)	0.396885 / 0.258489 (0.138396)	0.431508 / 0.293841 (0.137667)	0.032305 / 0.128546 (-0.096241)	0.008617 / 0.075646 (-0.067030)	0.071577 / 0.419271 (-0.347694)	0.047769 / 0.043533 (0.004236)	0.394037 / 0.255139 (0.138898)	0.412593 / 0.283200 (0.129393)	0.023800 / 0.141683 (-0.117883)	1.479114 / 1.452155 (0.026959)	1.562422 / 1.492716 (0.069706)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.229822 / 0.018006 (0.211816)	0.452465 / 0.000490 (0.451975)	0.005877 / 0.000200 (0.005677)	0.000097 / 0.000054 (0.000043)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033528 / 0.037411 (-0.003884)	0.091819 / 0.014526 (0.077294)	0.106188 / 0.176557 (-0.070368)	0.159480 / 0.737135 (-0.577655)	0.106326 / 0.296338 (-0.190013)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.427396 / 0.215209 (0.212187)	4.275196 / 2.077655 (2.197541)	2.287446 / 1.504120 (0.783326)	2.137089 / 1.541195 (0.595894)	2.198439 / 1.468490 (0.729949)	0.491006 / 4.584777 (-4.093771)	3.531067 / 3.745712 (-0.214645)	3.264357 / 5.269862 (-2.005505)	2.047760 / 4.565676 (-2.517916)	0.057982 / 0.424275 (-0.366293)	0.007278 / 0.007607 (-0.000329)	0.507471 / 0.226044 (0.281426)	5.073901 / 2.268929 (2.804973)	2.781799 / 55.444624 (-52.662825)	2.410759 / 6.876477 (-4.465718)	2.623331 / 2.142072 (0.481258)	0.601601 / 4.805227 (-4.203626)	0.131461 / 6.500664 (-6.369204)	0.060045 / 0.075469 (-0.015424)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.372946 / 1.841788 (-0.468842)	19.560818 / 8.074308 (11.486509)	14.388468 / 10.191392 (4.197076)	0.177310 / 0.680424 (-0.503114)	0.020233 / 0.534201 (-0.513967)	0.395938 / 0.579283 (-0.183345)	0.418336 / 0.434364 (-0.016028)	0.471731 / 0.540337 (-0.068607)	0.684679 / 1.386936 (-0.702257)

Close #6330 `fsspec.implementations.LocalFilesystem.protocol` was changed from `str` "file" to `tuple[str,...]` ("file", "local") in `fsspec>=2023.10.0` This commit supports both styles.

lhoestq · 2023-10-23T17:30:54Z

We did a patch release containing your fix @ap-- !

Close #6330 `fsspec.implementations.LocalFilesystem.protocol` was changed from `str` "file" to `tuple[str,...]` ("file", "local") in `fsspec>=2023.10.0` This commit supports both styles.

datasets.filesystems: fix is_remote_filesystems

8dbcb91

Close huggingface#6330 `fsspec.implementations.LocalFilesystem.protocol` was changed from `str` "file" to `tuple[str,...]` ("file", "local") in `fsspec>=2023.10.0` This commit supports both styles.

ap-- force-pushed the fsspec-local-compat branch from 02e9e16 to 8dbcb91 Compare October 23, 2023 09:20

lhoestq mentioned this pull request Oct 23, 2023

Support fsspec 2023.10.0 #6335

Closed

lhoestq approved these changes Oct 23, 2023

View reviewed changes

lhoestq merged commit 4bedb7d into huggingface:main Oct 23, 2023
11 of 12 checks passed

Wauplin mentioned this pull request Oct 23, 2023

Limit to inferior fsspec version huggingface/huggingface_hub#1773

Closed

albertvillanova mentioned this pull request Nov 7, 2023

Error loading wikitext data raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).__name__} is not supported.") #6352

Closed

anine09 mentioned this pull request Dec 18, 2023

关于huanhuan-chat微调报错的问题 KMnO4-zx/huanhuan-chat#14

Closed

tomscholz mentioned this pull request Feb 7, 2024

Support fsspec 2023.10.0 #6333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets.filesystems: fix is_remote_filesystems #6334

datasets.filesystems: fix is_remote_filesystems #6334

ap-- commented Oct 23, 2023 •

edited by albertvillanova

Loading

HuggingFaceDocBuilderDev commented Oct 23, 2023 •

edited

Loading

lhoestq left a comment

github-actions bot commented Oct 23, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Oct 23, 2023

datasets.filesystems: fix is_remote_filesystems #6334

datasets.filesystems: fix is_remote_filesystems #6334

Conversation

ap-- commented Oct 23, 2023 • edited by albertvillanova Loading

HuggingFaceDocBuilderDev commented Oct 23, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 23, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Oct 23, 2023

ap-- commented Oct 23, 2023 •

edited by albertvillanova

Loading

HuggingFaceDocBuilderDev commented Oct 23, 2023 •

edited

Loading