From 555a3e11002c4d03be91272d247d97ea91c5ffc9 Mon Sep 17 00:00:00 2001
From: Victor Lin <13424970+victorlin@users.noreply.github.com>
Date: Fri, 16 Aug 2024 12:40:10 -0700
Subject: [PATCH] =?UTF-8?q?=F0=9F=9A=A7=20describe=20order=20of=20operatio?=
 =?UTF-8?q?ns=20and=20add=20more=20filtering=20examples?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../filtering-and-subsampling.rst             | 111 ++++++++++++++----
 1 file changed, 86 insertions(+), 25 deletions(-)

diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst
index b2f8b6d9..94010baf 100644
--- a/src/guides/bioinformatics/filtering-and-subsampling.rst
+++ b/src/guides/bioinformatics/filtering-and-subsampling.rst
@@ -12,7 +12,8 @@ Filtering
 =========
 
 The filter command allows you to select various subsets of your input data for
-different types of analysis. A simple example use of this command would be
+different types of analysis. A simple example would be to select all sequences
+with collection date in 2012 or later:
 
 .. code-block:: bash
 
@@ -23,34 +24,94 @@ different types of analysis. A simple example use of this command would be
      --output-sequences filtered_sequences.fasta \
      --output-metadata filtered_metadata.tsv
 
-This command will select all sequences with collection date in 2012 or later.
-The filter command has a large number of options that allow flexible filtering
-for many common situations. One such use-case is the exclusion of sequences that
-are known to be outliers (e.g. because of sequencing errors, cell-culture
-adaptation, ...). These can be specified in a separate text file (e.g.
-``exclude.txt``):
-
-.. code-block::
-
-   BRA/2016/FC_DQ75D1
-   COL/FLR_00034/2015
-   ...
-
-To drop such strains, you can pass the filename to ``--exclude``:
-
-.. code-block:: bash
-
-   augur filter \
-     --sequences data/sequences.fasta \
-     --metadata data/metadata.tsv \
-     --min-date 2012 \
-     --exclude exclude.txt \
-     --output-sequences filtered_sequences.fasta \
-     --output-metadata filtered_metadata.tsv
+There are several options that allow flexible filtering for many common
+situations. Options can be divided into the following categories:
+
+- **Metadata-based** options work with data available from ``--metadata``.
+- **Sequence-based** options work with data available from ``--sequences`` or
+  ``--sequence-index``.
+- **Standard** options work by selecting or dropping sequences that match
+  certain criteria.
+- **Force-inclusive** options work by ensuring sequences that match certain
+  criteria are always included in the output, ignoring all standard filter
+  options.
+
+.. list-table:: Categories for filter options
+   :header-rows: 1
+   :stub-columns: 1
+
+   * - 
+     - Metadata-based
+     - Sequence-based
+   * - Standard
+     - * ``--min-date``
+       * ``--max-date``
+       * ``--exclude-ambiguous-dates-by``
+       * ``--exclude``
+       * ``--exclude-where``
+       * ``--query``
+     - * ``--min-length``
+       * ``--max-length``
+       * ``--non-nucleotide``
+   * - Force-inclusive
+     - * ``--include``
+       * ``--include-where``
+     - *None*
+
+Below are additional examples.
+
+- Exclude outliers (e.g. because of sequencing errors, cell-culture adaptation)
+  using ``--exclude``. First, create a text file ``exclude.txt`` with one line
+  per sequence ID:
+
+  .. code-block::
+
+      BRA/2016/FC_DQ75D1
+      COL/FLR_00034/2015
+      ...
+
+  Add the option by using ``--exclude exclude.txt`` in the command:
+
+  .. code-block:: bash
+
+      augur filter \
+        --sequences data/sequences.fasta \
+        --metadata data/metadata.tsv \
+        --min-date 2012 \
+        --exclude exclude.txt \
+        --output-sequences filtered_sequences.fasta \
+        --output-metadata filtered_metadata.tsv
+
+- Include sequences from a specific region using ``--query``:
+
+  .. code-block:: bash
+
+      augur filter \
+        --sequences data/sequences.fasta \
+        --metadata data/metadata.tsv \
+        --min-date 2012 \
+        --exclude exclude.txt \
+        --query 'region="Asia"' \
+        --output-sequences filtered_sequences.fasta \
+        --output-metadata filtered_metadata.tsv
+
+  .. tip::
+
+      ``--query 'region="Asia"'`` is functionally equivalent to ``--exclude-where
+      region!=Asia``. However, ``--query`` allows for more complex expressions such
+      as ``--query '(region in {"Asia", "Europe"}) & (coverage >= 0.95)'``.
+
+      ``--query 'region="Asia"'`` is **not** equivalent to ``--include-where
+      region=Asia`` since force-inclusive options ignore any other standard filter
+      options (i.e. ``--min-date`` and ``--exclude`` in the example above).
 
 Subsampling within ``augur filter``
 ===================================
 
+.. note:: FIXME: add this text somewhere:
+
+   Subsampling is applied after all standard filter options and before force-inclusive filter options.
+
 Another common filtering operation is subsetting of data to a achieve a more
 even spatio-temporal distribution or to cut-down data set size to more
 manageable numbers. The filter command allows you to partition the data into