Skip to content

Commit

Permalink
Fix bookcorpusopen RAM usage (#3280)
Browse files Browse the repository at this point in the history
* fix bookcorpusopen ram usage

* add tags
  • Loading branch information
lhoestq authored Nov 16, 2021
1 parent 2a64976 commit ed1b492
Show file tree
Hide file tree
Showing 3 changed files with 39 additions and 4 deletions.
19 changes: 18 additions & 1 deletion datasets/bookcorpus/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,27 @@
---
annotations_creators:
- no-annotation
language_creators:
- found
languages:
- en
licenses:
- unknown
multilinguality:
- monolingual
pretty_name: BookCorpus
size_categories:
- 10M<n<100M
source_datasets:
- original
task_categories:
- sequence-modeling
task_ids:
- language-modeling
paperswithcode_id: bookcorpus
---

# Dataset Card for "bookcorpus"
# Dataset Card for BookCorpus

## Table of Contents
- [Dataset Description](#dataset-description)
Expand Down
22 changes: 19 additions & 3 deletions datasets/bookcorpusopen/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,27 @@
---
pretty_name: BookCorpusOpen
annotations_creators:
- no-annotation
language_creators:
- found
languages:
- en
paperswithcode_id: null
licenses:
- unknown
multilinguality:
- monolingual
pretty_name: BookCorpusOpen
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- sequence-modeling
task_ids:
- language-modeling
paperswithcode_id: bookcorpus
---

# Dataset Card for "bookcorpusopen"
# Dataset Card for BookCorpusOpen

## Table of Contents
- [Dataset Description](#dataset-description)
Expand Down
2 changes: 2 additions & 0 deletions datasets/bookcorpusopen/bookcorpusopen.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ def __init__(self, **kwargs):
class BookCorpusOpen(datasets.GeneratorBasedBuilder):
"""BookCorpus dataset."""

DEFAULT_WRITER_BATCH_SIZE = 256 # documents are full books and are quite heavy

BUILDER_CONFIGS = [
BookCorpusOpenConfig(
name="plain_text",
Expand Down

1 comment on commit ed1b492

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.081857 / 0.011353 (0.070504) 0.004388 / 0.011008 (-0.006620) 0.037163 / 0.038508 (-0.001345) 0.040607 / 0.023109 (0.017498) 0.365060 / 0.275898 (0.089162) 0.406773 / 0.323480 (0.083293) 0.095299 / 0.007986 (0.087314) 0.004834 / 0.004328 (0.000506) 0.010609 / 0.004250 (0.006358) 0.047172 / 0.037052 (0.010119) 0.366245 / 0.258489 (0.107756) 0.414755 / 0.293841 (0.120914) 0.099590 / 0.128546 (-0.028956) 0.010128 / 0.075646 (-0.065519) 0.298910 / 0.419271 (-0.120362) 0.053226 / 0.043533 (0.009693) 0.359920 / 0.255139 (0.104781) 0.398549 / 0.283200 (0.115350) 0.092322 / 0.141683 (-0.049361) 2.083043 / 1.452155 (0.630889) 2.123706 / 1.492716 (0.630989)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.266145 / 0.018006 (0.248139) 0.486885 / 0.000490 (0.486395) 0.005023 / 0.000200 (0.004824) 0.000096 / 0.000054 (0.000042)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.043372 / 0.037411 (0.005961) 0.025473 / 0.014526 (0.010947) 0.030447 / 0.176557 (-0.146109) 0.231008 / 0.737135 (-0.506128) 0.031654 / 0.296338 (-0.264684)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.495416 / 0.215209 (0.280207) 4.915116 / 2.077655 (2.837462) 2.132964 / 1.504120 (0.628844) 1.887751 / 1.541195 (0.346556) 1.962137 / 1.468490 (0.493647) 0.495626 / 4.584777 (-4.089151) 5.854865 / 3.745712 (2.109153) 2.602279 / 5.269862 (-2.667583) 1.038505 / 4.565676 (-3.527172) 0.059655 / 0.424275 (-0.364620) 0.013014 / 0.007607 (0.005407) 0.623897 / 0.226044 (0.397853) 6.215239 / 2.268929 (3.946311) 2.756838 / 55.444624 (-52.687786) 2.228969 / 6.876477 (-4.647508) 2.316861 / 2.142072 (0.174788) 0.621982 / 4.805227 (-4.183245) 0.134140 / 6.500664 (-6.366525) 0.065565 / 0.075469 (-0.009904)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.825945 / 1.841788 (-0.015842) 14.070411 / 8.074308 (5.996103) 30.854243 / 10.191392 (20.662851) 0.874932 / 0.680424 (0.194508) 0.608280 / 0.534201 (0.074079) 0.437656 / 0.579283 (-0.141627) 0.628999 / 0.434364 (0.194635) 0.298962 / 0.540337 (-0.241376) 0.319340 / 1.386936 (-1.067596)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.079747 / 0.011353 (0.068394) 0.004481 / 0.011008 (-0.006527) 0.035008 / 0.038508 (-0.003500) 0.038096 / 0.023109 (0.014987) 0.349311 / 0.275898 (0.073413) 0.382265 / 0.323480 (0.058785) 0.093284 / 0.007986 (0.085298) 0.005460 / 0.004328 (0.001132) 0.008413 / 0.004250 (0.004163) 0.039980 / 0.037052 (0.002928) 0.345536 / 0.258489 (0.087047) 0.399789 / 0.293841 (0.105948) 0.099879 / 0.128546 (-0.028667) 0.010244 / 0.075646 (-0.065402) 0.296414 / 0.419271 (-0.122857) 0.052745 / 0.043533 (0.009213) 0.349604 / 0.255139 (0.094465) 0.381753 / 0.283200 (0.098553) 0.088288 / 0.141683 (-0.053395) 1.947022 / 1.452155 (0.494868) 1.989497 / 1.492716 (0.496780)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.281184 / 0.018006 (0.263178) 0.488578 / 0.000490 (0.488088) 0.005631 / 0.000200 (0.005431) 0.000127 / 0.000054 (0.000072)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.039799 / 0.037411 (0.002388) 0.024891 / 0.014526 (0.010365) 0.028901 / 0.176557 (-0.147655) 0.228627 / 0.737135 (-0.508508) 0.030334 / 0.296338 (-0.266004)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.494716 / 0.215209 (0.279507) 4.975615 / 2.077655 (2.897960) 2.181503 / 1.504120 (0.677383) 1.943726 / 1.541195 (0.402531) 2.010411 / 1.468490 (0.541921) 0.487131 / 4.584777 (-4.097646) 5.930115 / 3.745712 (2.184403) 2.451512 / 5.269862 (-2.818350) 1.045552 / 4.565676 (-3.520124) 0.059296 / 0.424275 (-0.364979) 0.013270 / 0.007607 (0.005663) 0.613032 / 0.226044 (0.386987) 6.157650 / 2.268929 (3.888722) 2.671311 / 55.444624 (-52.773313) 2.232013 / 6.876477 (-4.644463) 2.329119 / 2.142072 (0.187046) 0.630458 / 4.805227 (-4.174769) 0.135463 / 6.500664 (-6.365201) 0.064467 / 0.075469 (-0.011003)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.851893 / 1.841788 (0.010106) 13.942856 / 8.074308 (5.868548) 31.502644 / 10.191392 (21.311252) 0.961222 / 0.680424 (0.280798) 0.655116 / 0.534201 (0.120916) 0.438665 / 0.579283 (-0.140618) 0.628870 / 0.434364 (0.194506) 0.310226 / 0.540337 (-0.230111) 0.323134 / 1.386936 (-1.063802)

CML watermark

Please sign in to comment.