Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tiered subsampling example #226

Merged
merged 1 commit into from
Aug 28, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions src/guides/bioinformatics/filtering-and-subsampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -350,7 +350,7 @@ This approach has some caveats:

{n_{\text{other sequences}}} * \frac{1}{{n_{\text{other states}}}}
= 100 * \frac{1}{49}
\approx 1.02
\approx 2.04

2. Achieving a full *100 sequences from the rest of the United States* requires
at least 2 sequences from each of the remaining states. This may not be
Expand All @@ -366,8 +366,8 @@ An alternative approach is to decompose this into multiple schemes, each handled
by a single call to ``augur filter``. Additionally, there is an extra step to
combine the intermediate samples.

1. Sample 100 sequences from Washington state.
2. Sample 50 sequences from the rest of the United States.
1. Sample 200 sequences from Washington state.
2. Sample 100 sequences from the rest of the United States.
3. Combine the samples.

Calling ``augur filter`` multiple times
Expand All @@ -378,20 +378,20 @@ well for ad-hoc analyses.

.. code-block:: bash

# 1. Sample 100 sequences from Washington state
# 1. Sample 200 sequences from Washington state
augur filter \
--sequences sequences.fasta \
--metadata metadata.tsv \
--query "state == 'WA'" \
--subsample-max-sequences 100 \
--subsample-max-sequences 200 \
--output-strains sample_strains_state.txt

# 2. Sample 50 sequences from the rest of the United States
# 2. Sample 100 sequences from the rest of the United States
augur filter \
--sequences sequences.fasta \
--metadata metadata.tsv \
--query "state != 'WA' & country == 'USA'" \
--subsample-max-sequences 50 \
--subsample-max-sequences 100 \
--output-strains sample_strains_country.txt

# 3. Combine using augur filter
Expand Down Expand Up @@ -428,8 +428,8 @@ system can be used. The following examples use `Snakemake`_.
.. code-block:: yaml

subsampling:
state: --query "state == 'WA'" --subsample-max-sequences 100
country: --query "state != 'WA' & country == 'USA'" --subsample-max-sequences 50
state: --query "state == 'WA'" --subsample-max-sequences 200
country: --query "state != 'WA' & country == 'USA'" --subsample-max-sequences 100

2. Add two rules in a `Snakefile`_. If you are building a standard Nextstrain
workflow, the output files should be used as input to sequence alignment. See
Expand All @@ -438,8 +438,8 @@ system can be used. The following examples use `Snakemake`_.

.. code-block:: python

# 1. Sample 100 sequences from Washington state
# 2. Sample 50 sequences from the rest of the United States
# 1. Sample 200 sequences from Washington state
# 2. Sample 100 sequences from the rest of the United States
rule intermediate_sample:
input:
metadata = "data/metadata.tsv",
Expand Down
Loading