Skip to content

Commit

Permalink
MRG: add scaled FAQ, adjust ksize answer (#2921)
Browse files Browse the repository at this point in the history
Adds the following FAQ entry to address
#2918:

> ## What scaled values should I use with sourmash?
> 
> We recommend scaled=1000 or scaled=10000 when working with bacterial
> and archaeal sketches and DNA. We have quite a bit of experience with
> this, and even some
> [published
benchmarks](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0)
> showing that this works very well.  You may need to use lower scaled
> values with smaller query and target sequences, such as viral genomes
> or genes, but we do not have systematic advice on this.
> 
> That having been said, you can always use a lower scaled value - the
only
> consequence is that memory and compute requirements increase.
> 
> Also, sourmash will automatically use the larger of two scaled values
> when comparing two sketches with different scaled values. So if, for
example,
> you use [the precomputed databases](databases.md), you will always end
up
> using your query sketches at a minimum scaled of 1000, even if you
created
> them with a lower scaled value.
> 
> Please also see [What resolution should my signatures
be?](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-create-them).

Fixes #2918

---------

Co-authored-by: Colton Baumler <63077899+ccbaumler@users.noreply.github.com>
  • Loading branch information
ctb and ccbaumler authored Jan 15, 2024
1 parent 4f32abc commit 11af4d5
Showing 1 changed file with 23 additions and 1 deletion.
24 changes: 23 additions & 1 deletion doc/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,12 +113,34 @@ and k=51 are negligible; and that (b) k=31 works fine for most
day-to-day use of sourmash.

We also provide [Genbank and GTDB databases](databases.md) for k=21,
k=31, and k=51.
k=31, and k=51, so choosing from those k-mer sizes for your own sketches
will allow you to directly use those databases.

For some background on k-mer specificity, we recommend this paper:
[MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation](https://journals.asm.org/doi/10.1128/msystems.00020-16),
Koslicki & Falush, 2016.

## What scaled values should I use with sourmash?

We recommend scaled=1000 or scaled=10000 when working with bacterial
and archaeal sketches and DNA. We have quite a bit of experience with
this, and even some
[published benchmarks](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0)
showing that this works very well. You may need to use lower scaled
values with smaller query and target sequences, such as viral genomes
or genes, but we do not have systematic advice on this.

That having been said, you can always use a lower scaled value - the only
consequence is that memory and compute requirements increase.

Also, sourmash will automatically use the larger of two scaled values
when comparing two sketches with different scaled values. So if, for example,
you use [the precomputed databases](databases.md), you will always end up
using your query sketches at a minimum scaled of 1000, even if you created
them with a lower scaled value.

Please also see [What resolution should my signatures be?](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-create-them).

## How do k-mer-based analyses compare with read mapping?

tl;dr very well! But it's a bit one sided: if k-mers match, reads will
Expand Down

0 comments on commit 11af4d5

Please sign in to comment.