From 555a3e11002c4d03be91272d247d97ea91c5ffc9 Mon Sep 17 00:00:00 2001 From: Victor Lin <13424970+victorlin@users.noreply.github.com> Date: Fri, 16 Aug 2024 12:40:10 -0700 Subject: [PATCH] =?UTF-8?q?=F0=9F=9A=A7=20describe=20order=20of=20operatio?= =?UTF-8?q?ns=20and=20add=20more=20filtering=20examples?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../filtering-and-subsampling.rst | 111 ++++++++++++++---- 1 file changed, 86 insertions(+), 25 deletions(-) diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst index b2f8b6d9..94010baf 100644 --- a/src/guides/bioinformatics/filtering-and-subsampling.rst +++ b/src/guides/bioinformatics/filtering-and-subsampling.rst @@ -12,7 +12,8 @@ Filtering ========= The filter command allows you to select various subsets of your input data for -different types of analysis. A simple example use of this command would be +different types of analysis. A simple example would be to select all sequences +with collection date in 2012 or later: .. code-block:: bash @@ -23,34 +24,94 @@ different types of analysis. A simple example use of this command would be --output-sequences filtered_sequences.fasta \ --output-metadata filtered_metadata.tsv -This command will select all sequences with collection date in 2012 or later. -The filter command has a large number of options that allow flexible filtering -for many common situations. One such use-case is the exclusion of sequences that -are known to be outliers (e.g. because of sequencing errors, cell-culture -adaptation, ...). These can be specified in a separate text file (e.g. -``exclude.txt``): - -.. code-block:: - - BRA/2016/FC_DQ75D1 - COL/FLR_00034/2015 - ... - -To drop such strains, you can pass the filename to ``--exclude``: - -.. code-block:: bash - - augur filter \ - --sequences data/sequences.fasta \ - --metadata data/metadata.tsv \ - --min-date 2012 \ - --exclude exclude.txt \ - --output-sequences filtered_sequences.fasta \ - --output-metadata filtered_metadata.tsv +There are several options that allow flexible filtering for many common +situations. Options can be divided into the following categories: + +- **Metadata-based** options work with data available from ``--metadata``. +- **Sequence-based** options work with data available from ``--sequences`` or + ``--sequence-index``. +- **Standard** options work by selecting or dropping sequences that match + certain criteria. +- **Force-inclusive** options work by ensuring sequences that match certain + criteria are always included in the output, ignoring all standard filter + options. + +.. list-table:: Categories for filter options + :header-rows: 1 + :stub-columns: 1 + + * - + - Metadata-based + - Sequence-based + * - Standard + - * ``--min-date`` + * ``--max-date`` + * ``--exclude-ambiguous-dates-by`` + * ``--exclude`` + * ``--exclude-where`` + * ``--query`` + - * ``--min-length`` + * ``--max-length`` + * ``--non-nucleotide`` + * - Force-inclusive + - * ``--include`` + * ``--include-where`` + - *None* + +Below are additional examples. + +- Exclude outliers (e.g. because of sequencing errors, cell-culture adaptation) + using ``--exclude``. First, create a text file ``exclude.txt`` with one line + per sequence ID: + + .. code-block:: + + BRA/2016/FC_DQ75D1 + COL/FLR_00034/2015 + ... + + Add the option by using ``--exclude exclude.txt`` in the command: + + .. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --exclude exclude.txt \ + --output-sequences filtered_sequences.fasta \ + --output-metadata filtered_metadata.tsv + +- Include sequences from a specific region using ``--query``: + + .. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --exclude exclude.txt \ + --query 'region="Asia"' \ + --output-sequences filtered_sequences.fasta \ + --output-metadata filtered_metadata.tsv + + .. tip:: + + ``--query 'region="Asia"'`` is functionally equivalent to ``--exclude-where + region!=Asia``. However, ``--query`` allows for more complex expressions such + as ``--query '(region in {"Asia", "Europe"}) & (coverage >= 0.95)'``. + + ``--query 'region="Asia"'`` is **not** equivalent to ``--include-where + region=Asia`` since force-inclusive options ignore any other standard filter + options (i.e. ``--min-date`` and ``--exclude`` in the example above). Subsampling within ``augur filter`` =================================== +.. note:: FIXME: add this text somewhere: + + Subsampling is applied after all standard filter options and before force-inclusive filter options. + Another common filtering operation is subsetting of data to a achieve a more even spatio-temporal distribution or to cut-down data set size to more manageable numbers. The filter command allows you to partition the data into