Skip to content

Commit

Permalink
typos + grammar (#1955)
Browse files Browse the repository at this point in the history
* typos + grammar

* more typos
  • Loading branch information
stas00 authored Mar 1, 2021
1 parent 9745965 commit dfebdfe
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 15 deletions.
24 changes: 12 additions & 12 deletions docs/source/loading_metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,46 +119,46 @@ To select a specific configuration of a metric, just provide its name as the sec
Distributed setups
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In several settings, computing metrics in distributed or parrallel processing environments can be tricky since the evaluation on different sub-sets of the data is done in separate python processes. The ``datasets`` library make this easier to deal with as we detail in this section.
Computing metrics in distributed and parallel processing environments can be tricky since the evaluation on different sub-sets of the data is done in separate python processes. The ``datasets`` library overcomes this difficulty using the method described in this section.

.. note::

When a metric score is additive with regards to the dataset sub-set (meaning that ``f(A∪B) = f(A) + f(B)``) you can use distributed reduce operations to gather the scores computed by different processes. But when a metric is non additive (``f(A∪B) ≠ f(A) + f(B)``) which happens even for simple metrics like F1, you cannot simply gather the results of metrics evaluation on different sub-sets. A usual way to overcome this issue is to fallback on (inefficient) single process evaluation (e.g. evaluating metrics on a single GPU). The ``datasets`` library solve this problem by allowing distributed evaluation for any type of metric as detailed in this section.
When a metric score is additive with regards to the dataset sub-set (meaning that ``f(A∪B) = f(A) + f(B)``) you can use distributed reduce operations to gather the scores computed by different processes. But when a metric is non-additive (``f(A∪B) ≠ f(A) + f(B)``) which happens even for simple metrics like F1, you cannot simply gather the results of metrics evaluation on different sub-sets. A usual way to overcome this issue is to fallback on (inefficient) single process evaluation (e.g. evaluating metrics on a single GPU). The ``datasets`` library solves this problem by allowing distributed evaluation for any type of metric as detailed in this section.

Let's first see how to use a metric in a distributed setting before giving a few words about the internals. Let's say we train and evaluate a model in 8 parallel processes (e.g. using PyTorch's `DistributedDataParallel <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`__ on a server with 8 GPUs).

We assume your python script can have access to:
We assume your python script has access to:

- the total number of processes as an integer we'll call ``num_process`` (in our example 8),
- the process id of each process as an integer between 0 and ``num_process-1`` that we'll call ``rank`` (in our case betwen 0 and 7 included).
1. the total number of processes as an integer we'll call ``num_process`` (in our example 8).
2. the process rank as an integer between 0 and ``num_process-1`` that we'll call ``rank`` (in our example between 0 and 7 included).

Here is how we can instantiate the metric in such a distributed script:

.. code-block::
>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', num_process=num_process, process_id=process_id)
>>> metric = load_metric('glue', 'mrpc', num_process=num_process, process_id=rank)
And that's it, you can use the metric on each node as described in :doc:`using_metrics` without taking special care for the distributed setting. In particular, the predictions and references can be computed and provided to the metric separately on each process. By default, the final evaluation of the metric will be done on the first node (rank 0) only when calling :func:`datasets.Metric.compute` after gathering the predictions and references from all the nodes. Computing on other processes (rank > 0) returns ``None``.

Under the hood :class:`datasets.Metric` use an Apache Arrow table to store (temporarly) predictions and references for each node on the hard-drive thereby avoiding to cluter the GPU or CPU memory. Once the final metric evalution is requested with :func:`datasets.Metric.compute`, the first node get access to all the nodes temp files and read them to compute the metric in one time.
Under the hood :class:`datasets.Metric` uses an Apache Arrow table to store (temporarly) predictions and references for each node on the filesystem thereby not cluttering the GPU or CPU memory. Once the final metric evalution is requested with :func:`datasets.Metric.compute`, the first node gets access to all the nodes' temp files and reads them to compute the metric at once.

This way it's possible to perform distributed predictions (which is important for evaluation speed in distributed setting) while allowing to use complex non-additive metrics and avoiding to cluter GPU/CPU memory for prediction storage.
This way it's possible to perform distributed predictions (which is important for evaluation speed in distributed setting) while allowing to use complex non-additive metrics and not wasting GPU/CPU memory with prediction data.

The synchronization is basically provided by the hard drive file access and filelocks.
The synchronization is performed with the help of file locks on the filesystem.


Multiple and independant distributed setups
Multiple and independent distributed setups
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In some cases, several **independant and not related** distributed evaluations might be running on the same server and the same file system at the same time (e.g. two independant multi-processing trainings running on the same server) and it is then important to distinguish these experiemnts and allow them to operate in independantly.
In some cases, several **independent and not related** distributed evaluations might be running on the same server and the same file system at the same time (e.g. two independent multi-processing trainings running on the same server) and it is then important to distinguish these experiemnts and allow them to operate in independently.

In this situation you should provide an ``experiment_id`` to :func:`datasets.load_metric` which has to be a unique identifier of the current distributed experiment.

This identifier will be added to the cache file used by each process of this evaluation to avoid conflicting access to the same cache files for storing predictions and references for each node.

.. note::
Specifying an ``experiment_id`` to :func:`datasets.load_metric` is only required in the specific situation where you have **independant (i.e. not related) distributed** evaluations running on the same file system at the same time.
Specifying an ``experiment_id`` to :func:`datasets.load_metric` is only required in the specific situation where you have **independent (i.e. not related) distributed** evaluations running on the same file system at the same time.

Here is an example:

Expand Down
6 changes: 3 additions & 3 deletions src/datasets/metric.py
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@ def _create_cache_file(self, timeout=1) -> Tuple[str, FileLock]:
if self.num_process != 1:
raise ValueError(
f"Error in _create_cache_file: another metric instance is already using the local cache file at {file_path}. "
f"Please specify an experiment_id (currently: {self.experiment_id}) to avoid colision "
f"Please specify an experiment_id (currently: {self.experiment_id}) to avoid collision "
f"between distributed metric instances."
)
if i == self.max_concurrent_cache_files - 1:
Expand Down Expand Up @@ -357,7 +357,7 @@ def _finalize(self):
except FileNotFoundError:
raise ValueError(
"Error in finalize: another metric instance is already using the local cache file. "
"Please specify an experiment_id to avoid colision between distributed metric instances."
"Please specify an experiment_id to avoid collision between distributed metric instances."
)

# Store file paths and locks and we will release/delete them after the computation.
Expand Down Expand Up @@ -461,7 +461,7 @@ def _init_writer(self, timeout=1):
except TimeoutError:
raise ValueError(
f"Error in _init_writer: another metric instance is already using the local cache file at {file_path}. "
f"Please specify an experiment_id (currently: {self.experiment_id}) to avoid colision "
f"Please specify an experiment_id (currently: {self.experiment_id}) to avoid collision "
f"between distributed metric instances."
)

Expand Down

1 comment on commit dfebdfe

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==0.17.1

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.014010 / 0.011353 (0.002657) 0.012233 / 0.011008 (0.001225) 0.037580 / 0.038508 (-0.000928) 0.031879 / 0.023109 (0.008770) 0.175814 / 0.275898 (-0.100084) 0.195134 / 0.323480 (-0.128345) 0.004057 / 0.007986 (-0.003929) 0.003860 / 0.004328 (-0.000469) 0.005831 / 0.004250 (0.001580) 0.041939 / 0.037052 (0.004887) 0.176463 / 0.258489 (-0.082026) 0.204812 / 0.293841 (-0.089029) 0.122424 / 0.128546 (-0.006122) 0.095164 / 0.075646 (0.019518) 0.357933 / 0.419271 (-0.061338) 0.356663 / 0.043533 (0.313130) 0.180608 / 0.255139 (-0.074531) 0.194551 / 0.283200 (-0.088649) 1.420723 / 0.141683 (1.279040) 1.531722 / 1.452155 (0.079568) 1.597025 / 1.492716 (0.104309)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035478 / 0.037411 (-0.001933) 0.016508 / 0.014526 (0.001982) 0.023139 / 0.176557 (-0.153417) 0.040941 / 0.737135 (-0.696195) 0.046671 / 0.296338 (-0.249667)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.198535 / 0.215209 (-0.016674) 1.984957 / 2.077655 (-0.092698) 1.083001 / 1.504120 (-0.421119) 0.982915 / 1.541195 (-0.558279) 1.008379 / 1.468490 (-0.460111) 5.180178 / 4.584777 (0.595401) 4.598820 / 3.745712 (0.853108) 6.964049 / 5.269862 (1.694188) 5.879927 / 4.565676 (1.314251) 0.518568 / 0.424275 (0.094293) 0.009225 / 0.007607 (0.001617) 0.224823 / 0.226044 (-0.001222) 2.355326 / 2.268929 (0.086397) 1.458459 / 55.444624 (-53.986166) 1.297904 / 6.876477 (-5.578573) 1.320950 / 2.142072 (-0.821123) 5.237895 / 4.805227 (0.432668) 3.294766 / 6.500664 (-3.205898) 4.259462 / 0.075469 (4.183993)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 9.023966 / 1.841788 (7.182178) 12.592502 / 8.074308 (4.518194) 15.129525 / 10.191392 (4.938133) 0.408668 / 0.680424 (-0.271756) 0.244384 / 0.534201 (-0.289817) 0.813322 / 0.579283 (0.234039) 0.475940 / 0.434364 (0.041576) 0.548975 / 0.540337 (0.008637) 1.302643 / 1.386936 (-0.084293)
PyArrow==1.0
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.013994 / 0.011353 (0.002641) 0.012399 / 0.011008 (0.001391) 0.037389 / 0.038508 (-0.001120) 0.030360 / 0.023109 (0.007251) 0.291766 / 0.275898 (0.015868) 0.323776 / 0.323480 (0.000296) 0.004133 / 0.007986 (-0.003852) 0.003819 / 0.004328 (-0.000510) 0.005863 / 0.004250 (0.001613) 0.047159 / 0.037052 (0.010107) 0.289145 / 0.258489 (0.030656) 0.329097 / 0.293841 (0.035256) 0.118816 / 0.128546 (-0.009731) 0.093813 / 0.075646 (0.018166) 0.360630 / 0.419271 (-0.058642) 0.347609 / 0.043533 (0.304077) 0.293286 / 0.255139 (0.038147) 0.316057 / 0.283200 (0.032857) 1.384448 / 0.141683 (1.242765) 1.568506 / 1.452155 (0.116352) 1.600118 / 1.492716 (0.107401)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.037179 / 0.037411 (-0.000232) 0.018880 / 0.014526 (0.004354) 0.023913 / 0.176557 (-0.152644) 0.041324 / 0.737135 (-0.695811) 0.024923 / 0.296338 (-0.271415)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.261839 / 0.215209 (0.046630) 2.619827 / 2.077655 (0.542172) 1.690678 / 1.504120 (0.186558) 1.603531 / 1.541195 (0.062336) 1.701047 / 1.468490 (0.232556) 4.994856 / 4.584777 (0.410079) 4.446247 / 3.745712 (0.700534) 6.836544 / 5.269862 (1.566683) 5.542108 / 4.565676 (0.976431) 0.501187 / 0.424275 (0.076912) 0.008995 / 0.007607 (0.001388) 0.287394 / 0.226044 (0.061350) 2.911328 / 2.268929 (0.642400) 2.077029 / 55.444624 (-53.367596) 1.905699 / 6.876477 (-4.970778) 1.957738 / 2.142072 (-0.184335) 5.035964 / 4.805227 (0.230737) 3.241213 / 6.500664 (-3.259451) 3.968385 / 0.075469 (3.892916)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 9.398312 / 1.841788 (7.556524) 13.775471 / 8.074308 (5.701163) 15.701319 / 10.191392 (5.509927) 0.991242 / 0.680424 (0.310818) 0.512964 / 0.534201 (-0.021237) 0.588664 / 0.579283 (0.009381) 0.454859 / 0.434364 (0.020495) 0.521543 / 0.540337 (-0.018795) 1.254258 / 1.386936 (-0.132678)

CML watermark

Please sign in to comment.