Skip to content

Commit

Permalink
Add docs for weighted sampling
Browse files Browse the repository at this point in the history
Add a new section and adjust existing content accordingly.
  • Loading branch information
victorlin committed Aug 21, 2024
1 parent 43c0224 commit e6f8cc8
Showing 1 changed file with 46 additions and 4 deletions.
50 changes: 46 additions & 4 deletions src/guides/bioinformatics/filtering-and-subsampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Selection method:
* - Subsampling
- * ``--subsample-max-sequences``
* ``--group-by``
* ``--group-by-weights``
* ``--sequences-per-group``
* ``--probabilistic-sampling``
* ``--no-probabilistic-sampling``
Expand Down Expand Up @@ -190,11 +191,17 @@ Grouped sampling
``--group-by`` allows you to partition the data into groups based on column
values and sample a number of sequences per group.

Grouped sampling can be further divided into two types with a final section for
caveats:

.. contents::
:local:

Uniform sampling
~~~~~~~~~~~~~~~~

``--group-by`` samples uniformly across groups. For example, sample evenly
across regions over time:
By default (i.e. without ``--group-by-weights``), ``--group-by`` will sample
uniformly across groups. For example, sample evenly across regions over time:

.. code-block:: bash
Expand Down Expand Up @@ -225,6 +232,40 @@ per month from each region:
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv
Weighted sampling
~~~~~~~~~~~~~~~~~

``--group-by-weights`` can be specified in addition to ``--group-by`` to allow
different target sizes per group. For example, target twice the amount of
sequences from Asia compared to other regions using this ``weights.tsv`` file:

.. list-table::
:header-rows: 1

* - region
- weight
* - Asia
- 2
* - default
- 1

and command:

.. code-block:: bash
augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude exclude.txt \
--group-by region year month \
--group-by-weights weights.tsv \
--subsample-max-sequences 100 \
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv
The weights file format is described in ``augur filter`` docs for
``--group-by-weights``.

Caveats
~~~~~~~
Expand All @@ -249,8 +290,9 @@ is noted in the output:
possible to have more than the requested maximum of 100 sequences after
filtering.
This is automatically enabled. To force the command to exit with an error in
these situations, use ``--no-probabilistic-sampling``.
This is automatically enabled. ``--no-probabilistic-sampling`` can be used with
uniform sampling to force the command to exit with an error in these situations.
It is always be enabled for weighted sampling.

Undersampling
`````````````
Expand Down

0 comments on commit e6f8cc8

Please sign in to comment.