Skip to content

Commit

Permalink
🚧 describe order of operations and add more filtering examples
Browse files Browse the repository at this point in the history
  • Loading branch information
victorlin committed Aug 16, 2024
1 parent b9ec0bb commit 555a3e1
Showing 1 changed file with 86 additions and 25 deletions.
111 changes: 86 additions & 25 deletions src/guides/bioinformatics/filtering-and-subsampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ Filtering
=========

The filter command allows you to select various subsets of your input data for
different types of analysis. A simple example use of this command would be
different types of analysis. A simple example would be to select all sequences
with collection date in 2012 or later:

.. code-block:: bash
Expand All @@ -23,34 +24,94 @@ different types of analysis. A simple example use of this command would be
--output-sequences filtered_sequences.fasta \
--output-metadata filtered_metadata.tsv
This command will select all sequences with collection date in 2012 or later.
The filter command has a large number of options that allow flexible filtering
for many common situations. One such use-case is the exclusion of sequences that
are known to be outliers (e.g. because of sequencing errors, cell-culture
adaptation, ...). These can be specified in a separate text file (e.g.
``exclude.txt``):

.. code-block::
BRA/2016/FC_DQ75D1
COL/FLR_00034/2015
...
To drop such strains, you can pass the filename to ``--exclude``:

.. code-block:: bash
augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude exclude.txt \
--output-sequences filtered_sequences.fasta \
--output-metadata filtered_metadata.tsv
There are several options that allow flexible filtering for many common
situations. Options can be divided into the following categories:

- **Metadata-based** options work with data available from ``--metadata``.
- **Sequence-based** options work with data available from ``--sequences`` or
``--sequence-index``.
- **Standard** options work by selecting or dropping sequences that match
certain criteria.
- **Force-inclusive** options work by ensuring sequences that match certain
criteria are always included in the output, ignoring all standard filter
options.

.. list-table:: Categories for filter options
:header-rows: 1
:stub-columns: 1

* -
- Metadata-based
- Sequence-based
* - Standard
- * ``--min-date``
* ``--max-date``
* ``--exclude-ambiguous-dates-by``
* ``--exclude``
* ``--exclude-where``
* ``--query``
- * ``--min-length``
* ``--max-length``
* ``--non-nucleotide``
* - Force-inclusive
- * ``--include``
* ``--include-where``
- *None*

Below are additional examples.

- Exclude outliers (e.g. because of sequencing errors, cell-culture adaptation)
using ``--exclude``. First, create a text file ``exclude.txt`` with one line
per sequence ID:

.. code-block::
BRA/2016/FC_DQ75D1
COL/FLR_00034/2015
...
Add the option by using ``--exclude exclude.txt`` in the command:

.. code-block:: bash
augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude exclude.txt \
--output-sequences filtered_sequences.fasta \
--output-metadata filtered_metadata.tsv
- Include sequences from a specific region using ``--query``:

.. code-block:: bash
augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude exclude.txt \
--query 'region="Asia"' \
--output-sequences filtered_sequences.fasta \
--output-metadata filtered_metadata.tsv
.. tip::

``--query 'region="Asia"'`` is functionally equivalent to ``--exclude-where
region!=Asia``. However, ``--query`` allows for more complex expressions such
as ``--query '(region in {"Asia", "Europe"}) & (coverage >= 0.95)'``.

``--query 'region="Asia"'`` is **not** equivalent to ``--include-where
region=Asia`` since force-inclusive options ignore any other standard filter
options (i.e. ``--min-date`` and ``--exclude`` in the example above).

Subsampling within ``augur filter``
===================================

.. note:: FIXME: add this text somewhere:

Subsampling is applied after all standard filter options and before force-inclusive filter options.

Another common filtering operation is subsetting of data to a achieve a more
even spatio-temporal distribution or to cut-down data set size to more
manageable numbers. The filter command allows you to partition the data into
Expand Down

0 comments on commit 555a3e1

Please sign in to comment.