Skip to content

Commit

Permalink
Update docs/source/use_dataset.rst
Browse files Browse the repository at this point in the history
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
  • Loading branch information
Rocketknight1 and stevhliu authored Nov 4, 2021
1 parent d03695d commit 1d6c174
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/source/use_dataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,6 @@ means they can be passed directly to methods like `model.fit()`. `to_tf_dataset(
.. tip::

``to_tf_dataset`` is the easiest way to create a TensorFlow compatible dataset. If, however, you don't want a `tf.data.Dataset`, but you would like the dataset to emit `tf.Tensor` objects, take a look at the :ref:`format` section instead!
``to_tf_dataset`` is the easiest way to create a TensorFlow compatible dataset. If you don't want a `tf.data.Dataset` and would rather the dataset emit `tf.Tensor` objects, take a look at the :ref:`format` section instead!

Your dataset is now ready for use in a training loop!

1 comment on commit 1d6c174

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.082096 / 0.011353 (0.070743) 0.005056 / 0.011008 (-0.005952) 0.046865 / 0.038508 (0.008357) 0.041431 / 0.023109 (0.018321) 0.371460 / 0.275898 (0.095562) 0.430239 / 0.323480 (0.106759) 0.096005 / 0.007986 (0.088020) 0.005249 / 0.004328 (0.000921) 0.011366 / 0.004250 (0.007116) 0.045347 / 0.037052 (0.008295) 0.382293 / 0.258489 (0.123804) 0.416125 / 0.293841 (0.122284) 0.108974 / 0.128546 (-0.019573) 0.014305 / 0.075646 (-0.061342) 0.331830 / 0.419271 (-0.087441) 0.060355 / 0.043533 (0.016822) 0.377713 / 0.255139 (0.122574) 0.430035 / 0.283200 (0.146835) 0.097013 / 0.141683 (-0.044670) 2.144914 / 1.452155 (0.692760) 2.213675 / 1.492716 (0.720959)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.295824 / 0.018006 (0.277817) 0.554859 / 0.000490 (0.554369) 0.004074 / 0.000200 (0.003874) 0.000125 / 0.000054 (0.000071)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.046005 / 0.037411 (0.008593) 0.030849 / 0.014526 (0.016323) 0.048028 / 0.176557 (-0.128528) 0.241477 / 0.737135 (-0.495658) 0.036493 / 0.296338 (-0.259846)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.642361 / 0.215209 (0.427152) 6.421359 / 2.077655 (4.343704) 2.468523 / 1.504120 (0.964403) 2.103756 / 1.541195 (0.562561) 2.160179 / 1.468490 (0.691689) 0.729837 / 4.584777 (-3.854940) 6.781262 / 3.745712 (3.035550) 3.376745 / 5.269862 (-1.893117) 1.488720 / 4.565676 (-3.076956) 0.082994 / 0.424275 (-0.341281) 0.014198 / 0.007607 (0.006591) 0.819222 / 0.226044 (0.593178) 7.982167 / 2.268929 (5.713239) 3.149283 / 55.444624 (-52.295342) 2.448434 / 6.876477 (-4.428043) 2.555578 / 2.142072 (0.413505) 0.885832 / 4.805227 (-3.919395) 0.187633 / 6.500664 (-6.313031) 0.075481 / 0.075469 (0.000012)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 2.093043 / 1.841788 (0.251255) 15.488186 / 8.074308 (7.413878) 45.074859 / 10.191392 (34.883467) 1.008077 / 0.680424 (0.327654) 0.681423 / 0.534201 (0.147222) 0.491142 / 0.579283 (-0.088141) 0.721010 / 0.434364 (0.286646) 0.352517 / 0.540337 (-0.187820) 0.371958 / 1.386936 (-1.014978)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.082476 / 0.011353 (0.071123) 0.005440 / 0.011008 (-0.005568) 0.041905 / 0.038508 (0.003397) 0.040841 / 0.023109 (0.017732) 0.391031 / 0.275898 (0.115133) 0.435257 / 0.323480 (0.111778) 0.118225 / 0.007986 (0.110240) 0.006114 / 0.004328 (0.001786) 0.009743 / 0.004250 (0.005493) 0.048389 / 0.037052 (0.011337) 0.391140 / 0.258489 (0.132650) 0.456186 / 0.293841 (0.162345) 0.107388 / 0.128546 (-0.021158) 0.014285 / 0.075646 (-0.061361) 0.365151 / 0.419271 (-0.054121) 0.067227 / 0.043533 (0.023694) 0.400961 / 0.255139 (0.145823) 0.429963 / 0.283200 (0.146763) 0.110971 / 0.141683 (-0.030712) 2.107069 / 1.452155 (0.654914) 2.199129 / 1.492716 (0.706413)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.396309 / 0.018006 (0.378303) 0.603780 / 0.000490 (0.603290) 0.025242 / 0.000200 (0.025042) 0.000450 / 0.000054 (0.000396)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.039870 / 0.037411 (0.002458) 0.028708 / 0.014526 (0.014182) 0.034555 / 0.176557 (-0.142002) 0.252287 / 0.737135 (-0.484849) 0.036107 / 0.296338 (-0.260232)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.660394 / 0.215209 (0.445185) 6.447103 / 2.077655 (4.369449) 2.520186 / 1.504120 (1.016066) 2.378869 / 1.541195 (0.837674) 2.188633 / 1.468490 (0.720143) 0.754651 / 4.584777 (-3.830126) 6.498995 / 3.745712 (2.753283) 3.108369 / 5.269862 (-2.161493) 1.500493 / 4.565676 (-3.065183) 0.083854 / 0.424275 (-0.340421) 0.014068 / 0.007607 (0.006461) 0.786766 / 0.226044 (0.560721) 7.999407 / 2.268929 (5.730478) 3.223927 / 55.444624 (-52.220697) 2.503074 / 6.876477 (-4.373402) 2.631013 / 2.142072 (0.488940) 0.914269 / 4.805227 (-3.890958) 0.185732 / 6.500664 (-6.314932) 0.074319 / 0.075469 (-0.001150)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 2.065228 / 1.841788 (0.223441) 15.192041 / 8.074308 (7.117733) 43.776302 / 10.191392 (33.584910) 1.003565 / 0.680424 (0.323141) 0.726144 / 0.534201 (0.191943) 0.489537 / 0.579283 (-0.089746) 0.722266 / 0.434364 (0.287902) 0.345511 / 0.540337 (-0.194826) 0.362300 / 1.386936 (-1.024636)

CML watermark

Please sign in to comment.