Skip to content

Commit

Permalink
save_to_disk note in docs
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed May 7, 2021
1 parent f72b88d commit 986899e
Showing 1 changed file with 12 additions and 0 deletions.
12 changes: 12 additions & 0 deletions src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -551,6 +551,18 @@ def save_to_disk(self, dataset_path: str, fs=None):
Saves a dataset to a dataset directory, or in a filesystem using either :class:`~filesystems.S3FileSystem` or
any implementation of ``fsspec.spec.AbstractFileSystem``.
Note regarding sliced datasets:
If you sliced the dataset in some way (using shard, train_test_split or select for example), then an indices mapping
is added to avoid having to rewrite a new arrow Table (save time + disk/memory usage).
It maps the indices used by __getitem__ to the right rows if the arrow Table.
By default save_to_disk does save the full dataset table + the mapping.
If you want to only save the shard of the dataset instead of the original arrow file and the indices,
then you have to call :func:`datasets.Dataset.flatten_indices` before saving.
This will create a new arrow table by using the right rows of the original table.
Args:
dataset_path (:obj:`str`): Path (e.g. `dataset/train`) or remote URI (e.g. `s3://my-bucket/dataset/train`)
of the dataset directory where the dataset will be saved to.
Expand Down

1 comment on commit 986899e

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==1.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.023162 / 0.011353 (0.011809) 0.015606 / 0.011008 (0.004598) 0.053877 / 0.038508 (0.015369) 0.043469 / 0.023109 (0.020360) 0.377292 / 0.275898 (0.101394) 0.395968 / 0.323480 (0.072488) 0.011656 / 0.007986 (0.003671) 0.005112 / 0.004328 (0.000783) 0.011726 / 0.004250 (0.007476) 0.057702 / 0.037052 (0.020650) 0.364382 / 0.258489 (0.105892) 0.413429 / 0.293841 (0.119588) 0.166645 / 0.128546 (0.038099) 0.118695 / 0.075646 (0.043049) 0.432385 / 0.419271 (0.013114) 0.396252 / 0.043533 (0.352719) 0.377944 / 0.255139 (0.122805) 0.390155 / 0.283200 (0.106955) 1.619336 / 0.141683 (1.477654) 1.860348 / 1.452155 (0.408194) 1.943134 / 1.492716 (0.450418)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.007890 / 0.018006 (-0.010117) 0.461877 / 0.000490 (0.461388) 0.000261 / 0.000200 (0.000061) 0.000050 / 0.000054 (-0.000005)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.046147 / 0.037411 (0.008736) 0.025201 / 0.014526 (0.010675) 0.034777 / 0.176557 (-0.141780) 0.053382 / 0.737135 (-0.683754) 0.030720 / 0.296338 (-0.265619)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.410970 / 0.215209 (0.195761) 4.141559 / 2.077655 (2.063905) 2.144003 / 1.504120 (0.639883) 1.938338 / 1.541195 (0.397143) 1.976579 / 1.468490 (0.508089) 6.707909 / 4.584777 (2.123132) 5.922603 / 3.745712 (2.176891) 8.369964 / 5.269862 (3.100102) 7.266887 / 4.565676 (2.701211) 0.670373 / 0.424275 (0.246098) 0.010828 / 0.007607 (0.003220) 0.539119 / 0.226044 (0.313074) 5.377854 / 2.268929 (3.108926) 2.587194 / 55.444624 (-52.857431) 2.196341 / 6.876477 (-4.680135) 2.239568 / 2.142072 (0.097496) 6.786019 / 4.805227 (1.980791) 3.780940 / 6.500664 (-2.719724) 5.829370 / 0.075469 (5.753901)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 10.368326 / 1.841788 (8.526539) 13.474037 / 8.074308 (5.399729) 30.372457 / 10.191392 (20.181065) 0.839509 / 0.680424 (0.159085) 0.622895 / 0.534201 (0.088694) 0.756864 / 0.579283 (0.177581) 0.594463 / 0.434364 (0.160099) 0.690657 / 0.540337 (0.150320) 1.532802 / 1.386936 (0.145866)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.022718 / 0.011353 (0.011365) 0.014501 / 0.011008 (0.003492) 0.052568 / 0.038508 (0.014060) 0.041178 / 0.023109 (0.018068) 0.316541 / 0.275898 (0.040643) 0.380009 / 0.323480 (0.056529) 0.012095 / 0.007986 (0.004109) 0.005205 / 0.004328 (0.000876) 0.013146 / 0.004250 (0.008895) 0.059177 / 0.037052 (0.022124) 0.347324 / 0.258489 (0.088835) 0.383088 / 0.293841 (0.089247) 0.161188 / 0.128546 (0.032642) 0.112994 / 0.075646 (0.037347) 0.439307 / 0.419271 (0.020036) 0.590221 / 0.043533 (0.546688) 0.321151 / 0.255139 (0.066012) 0.378160 / 0.283200 (0.094961) 3.539116 / 0.141683 (3.397433) 1.839280 / 1.452155 (0.387125) 1.900426 / 1.492716 (0.407709)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.008211 / 0.018006 (-0.009795) 0.466844 / 0.000490 (0.466355) 0.000359 / 0.000200 (0.000159) 0.000050 / 0.000054 (-0.000005)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.042893 / 0.037411 (0.005481) 0.026474 / 0.014526 (0.011948) 0.029204 / 0.176557 (-0.147352) 0.049679 / 0.737135 (-0.687456) 0.030088 / 0.296338 (-0.266250)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.419312 / 0.215209 (0.204103) 4.123229 / 2.077655 (2.045574) 2.122243 / 1.504120 (0.618123) 1.962395 / 1.541195 (0.421200) 1.998624 / 1.468490 (0.530133) 6.433827 / 4.584777 (1.849050) 5.471348 / 3.745712 (1.725636) 8.224529 / 5.269862 (2.954668) 7.274890 / 4.565676 (2.709213) 0.638836 / 0.424275 (0.214561) 0.010674 / 0.007607 (0.003067) 0.537315 / 0.226044 (0.311271) 5.358118 / 2.268929 (3.089190) 2.670066 / 55.444624 (-52.774558) 2.254971 / 6.876477 (-4.621506) 2.345918 / 2.142072 (0.203846) 6.575333 / 4.805227 (1.770105) 3.757150 / 6.500664 (-2.743514) 4.354212 / 0.075469 (4.278743)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 10.564374 / 1.841788 (8.722586) 13.415969 / 8.074308 (5.341661) 29.547836 / 10.191392 (19.356444) 0.925183 / 0.680424 (0.244759) 0.651351 / 0.534201 (0.117150) 0.729825 / 0.579283 (0.150542) 0.551268 / 0.434364 (0.116904) 0.651677 / 0.540337 (0.111339) 1.533151 / 1.386936 (0.146215)

CML watermark

Please sign in to comment.