Skip to content

Commit

Permalink
Fix max_shard_size docs (#5267)
Browse files Browse the repository at this point in the history
fix max_shard_size docs
  • Loading branch information
lhoestq authored Nov 18, 2022
1 parent 4eccb22 commit 7ef5f6d
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 7 deletions.
2 changes: 1 addition & 1 deletion docs/source/filesystems.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ Use your own data files (see [how to load local and remote files](./loading#loca
It is highly recommended to save the files as compressed Parquet files to optimize I/O by specifying `file_format="parquet"`.
Otherwise the dataset is saved as an uncompressed Arrow file.

You can also specify the size of the Parquet shard using `max_shard_size` (default is 500MB):
You can also specify the size of the shards using `max_shard_size` (default is 500MB):

```py
>>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet", max_shard_size="1GB")
Expand Down
12 changes: 6 additions & 6 deletions src/datasets/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -638,9 +638,9 @@ def download_and_prepare(
If the format is "parquet", then image and audio data are embedded into the Parquet files instead of pointing to local files.
<Added version="2.5.0"/>
max_shard_size (:obj:`Union[str, int]`, optional): Maximum number of bytes written per shard.
Only available for the "parquet" format with a default of "500MB". The size is based on uncompressed data size,
so in practice your shard files may be smaller than `max_shard_size` thanks to Parquet compression.
max_shard_size (:obj:`Union[str, int]`, optional): Maximum number of bytes written per shard, default is "500MB".
The size is based on uncompressed data size, so in practice your shard files may be smaller than
`max_shard_size` thanks to Parquet compression for example.
<Added version="2.5.0"/>
num_proc (:obj:`int`, optional, default `None`): Number of processes when downloading and generating the dataset locally.
Expand Down Expand Up @@ -1262,9 +1262,9 @@ def _prepare_split(
split_generator: `SplitGenerator`, Split generator to process
file_format (:obj:`str`, optional): format of the data files in which the dataset will be written.
Supported formats: "arrow", "parquet". Default to "arrow" format.
max_shard_size (:obj:`Union[str, int]`, optional): Approximate maximum number of bytes written per shard.
Only available for the "parquet" format with a default of "500MB". The size is based on uncompressed data size,
so in practice your shard files may be smaller than `max_shard_size` thanks to Parquet compression.
max_shard_size (:obj:`Union[str, int]`, optional): Maximum number of bytes written per shard, default is "500MB".
The size is based on uncompressed data size, so in practice your shard files may be smaller than
`max_shard_size` thanks to Parquet compression for example.
num_proc (:obj:`int`, optional, default `None`): Number of processes when downloading and generating the dataset locally.
Multiprocessing is disabled by default.
Expand Down

1 comment on commit 7ef5f6d

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.011073 / 0.011353 (-0.000280) 0.005484 / 0.011008 (-0.005524) 0.115306 / 0.038508 (0.076798) 0.042395 / 0.023109 (0.019285) 0.343413 / 0.275898 (0.067515) 0.424444 / 0.323480 (0.100964) 0.009155 / 0.007986 (0.001169) 0.005489 / 0.004328 (0.001160) 0.085350 / 0.004250 (0.081099) 0.050784 / 0.037052 (0.013731) 0.368642 / 0.258489 (0.110153) 0.413206 / 0.293841 (0.119365) 0.049601 / 0.128546 (-0.078945) 0.017367 / 0.075646 (-0.058279) 0.400343 / 0.419271 (-0.018929) 0.064526 / 0.043533 (0.020993) 0.343120 / 0.255139 (0.087981) 0.378312 / 0.283200 (0.095112) 0.123112 / 0.141683 (-0.018571) 1.761038 / 1.452155 (0.308883) 1.798257 / 1.492716 (0.305540)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.206371 / 0.018006 (0.188365) 0.459825 / 0.000490 (0.459336) 0.005775 / 0.000200 (0.005575) 0.000095 / 0.000054 (0.000041)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031806 / 0.037411 (-0.005605) 0.132351 / 0.014526 (0.117825) 0.141981 / 0.176557 (-0.034576) 0.193260 / 0.737135 (-0.543875) 0.148360 / 0.296338 (-0.147979)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.475418 / 0.215209 (0.260209) 4.712441 / 2.077655 (2.634786) 2.179472 / 1.504120 (0.675352) 1.958418 / 1.541195 (0.417223) 2.032576 / 1.468490 (0.564086) 0.806982 / 4.584777 (-3.777795) 4.558561 / 3.745712 (0.812848) 3.887119 / 5.269862 (-1.382743) 2.045730 / 4.565676 (-2.519946) 0.098682 / 0.424275 (-0.325593) 0.014341 / 0.007607 (0.006734) 0.603446 / 0.226044 (0.377401) 6.047170 / 2.268929 (3.778241) 2.764427 / 55.444624 (-52.680197) 2.344062 / 6.876477 (-4.532415) 2.501922 / 2.142072 (0.359849) 0.982312 / 4.805227 (-3.822915) 0.197039 / 6.500664 (-6.303626) 0.075762 / 0.075469 (0.000293)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.742089 / 1.841788 (-0.099699) 16.197975 / 8.074308 (8.123667) 29.633409 / 10.191392 (19.442017) 1.005976 / 0.680424 (0.325552) 0.651368 / 0.534201 (0.117167) 0.527047 / 0.579283 (-0.052236) 0.500853 / 0.434364 (0.066489) 0.316412 / 0.540337 (-0.223926) 0.320935 / 1.386936 (-1.066001)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008472 / 0.011353 (-0.002881) 0.005862 / 0.011008 (-0.005146) 0.113373 / 0.038508 (0.074865) 0.038897 / 0.023109 (0.015788) 0.399845 / 0.275898 (0.123947) 0.443008 / 0.323480 (0.119528) 0.006541 / 0.007986 (-0.001445) 0.004351 / 0.004328 (0.000022) 0.087023 / 0.004250 (0.082772) 0.045922 / 0.037052 (0.008869) 0.400247 / 0.258489 (0.141758) 0.463007 / 0.293841 (0.169166) 0.042312 / 0.128546 (-0.086235) 0.014082 / 0.075646 (-0.061564) 0.393038 / 0.419271 (-0.026233) 0.056288 / 0.043533 (0.012755) 0.396150 / 0.255139 (0.141011) 0.420499 / 0.283200 (0.137299) 0.122321 / 0.141683 (-0.019362) 1.799000 / 1.452155 (0.346845) 1.868746 / 1.492716 (0.376030)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.235673 / 0.018006 (0.217667) 0.457638 / 0.000490 (0.457148) 0.007405 / 0.000200 (0.007205) 0.000125 / 0.000054 (0.000070)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034717 / 0.037411 (-0.002694) 0.133172 / 0.014526 (0.118646) 0.145556 / 0.176557 (-0.031001) 0.193938 / 0.737135 (-0.543197) 0.150274 / 0.296338 (-0.146065)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.499109 / 0.215209 (0.283900) 4.965919 / 2.077655 (2.888264) 2.395092 / 1.504120 (0.890972) 2.170365 / 1.541195 (0.629171) 2.235573 / 1.468490 (0.767083) 0.813743 / 4.584777 (-3.771034) 4.643213 / 3.745712 (0.897501) 2.424351 / 5.269862 (-2.845510) 1.553961 / 4.565676 (-3.011716) 0.100707 / 0.424275 (-0.323568) 0.014105 / 0.007607 (0.006498) 0.618538 / 0.226044 (0.392493) 6.144881 / 2.268929 (3.875953) 2.979170 / 55.444624 (-52.465454) 2.575294 / 6.876477 (-4.301182) 2.703772 / 2.142072 (0.561700) 0.983350 / 4.805227 (-3.821877) 0.197577 / 6.500664 (-6.303087) 0.076393 / 0.075469 (0.000924)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.818415 / 1.841788 (-0.023373) 16.335548 / 8.074308 (8.261240) 14.052463 / 10.191392 (3.861071) 1.035579 / 0.680424 (0.355155) 0.692833 / 0.534201 (0.158632) 0.494219 / 0.579283 (-0.085064) 0.482415 / 0.434364 (0.048051) 0.295239 / 0.540337 (-0.245099) 0.309190 / 1.386936 (-1.077746)

Please sign in to comment.