Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] document gather abund tests a bit better; minor refactoring #886

Merged
merged 9 commits into from
Apr 28, 2020
77 changes: 71 additions & 6 deletions doc/classifying-signatures.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ One useful modification to `search` is to calculate containment with
matches where the query is contained within the subject, but the
subject may have many other k-mers in it. For example, if you are using
a plasmid as a query, you would use `--containment` to find genomes
that contained that plasmid.
that contained that plasmid. `gather` (discussed below) uses containment
analysis only.

See [the main sourmash
tutorial](http://sourmash.readthedocs.io/en/latest/tutorials.html#make-and-search-a-database-quickly)
Expand All @@ -39,17 +40,17 @@ mixture of many different genomes. While you might use containment to
see if a query genome is present in one or more metagenomes, a common
question to ask is the reverse: **what genomes are in my metagenome?**

We have implemented two algorithms in sourmash to do this.
We have implemented two approaches in sourmash to do this.

One algorithm uses taxonomic information from e.g. GenBank to classify
One approach uses taxonomic information from e.g. GenBank to classify
individual k-mers, and then infers taxonomic distributions of
metagenome contents from the presence of these individual
k-mers. (This is the approach pioneered by
[Kraken](https://ccb.jhu.edu/software/kraken/) and many other tools.)
[Kraken](https://ccb.jhu.edu/software/kraken/) and used by many other tools.)
`sourmash lca` can be used to classify individual genome bins with
`classify`, or summarize metagenome taxonomy with `summarize`. The
[sourmash lca tutorial](http://sourmash.readthedocs.io/en/latest/tutorials-lca.html)
shows how to use the `lca classify` and `summarize` commands, and also
shows how to use the `lca classify` and `lca summarize` commands, and also
provides guidance on building your own database.

The other approach, `gather`, breaks a metagenome down into individual
Expand Down Expand Up @@ -93,13 +94,77 @@ that the particular species isn't known.

## Abundance weighting

By default, sourmash tracks k-mer presence, *not* their abundance. The
proportions and fractions reported also ignore abundance. So, if
`sourmash gather` reports that a genome is 5% of a metagenome, it is
reporting Jaccard containment of that genome in the metagenome, and it
is ignoring information like the number of reads in the metagenome
that come from that genome. Similarly, when `sourmash compare`
compares genome or metagenome signatures, it's reporting Jaccard
similarity *without* abundance.

However, it is possible to take into account abundance information by
computing signatures with `--track-abundance`. The abundance
information will be used if it's present in the signature, and it can
be ignored with `--ignore-abundance` in any signature comparison.

There are two ways that abundance weighting can be used. One is in
containment queries for metagenomes, e.g. with `sourmash
gather`, and the other is in comparisons of abundance-weighted signatures,
e.g. with `sourmash search` and `sourmash compare`. Below, we refer to the
first as "abundance projection" and the second as "angular similarity".

### Projecting abundances in `sourmash gather`:

`sourmash gather` can report approximate abundance information for
containment queries against genome databases. This will give you
numbers that (approximately) match what you get from counting mapped
reads.

If you compute your input signatures with `--track-abundance`, both
`sourmash gather` and `sourmash lca gather` will use that information
to calculate an abundance-weighted result. Briefly, this will weight
to calculate an abundance-weighted result. This will weight
each match to a hash value by the multiplicity of the hash value in
the query signature. You can turn off this behavior with
`--ignore-abundance`.

For example, if you have a metagenome composed of two equal sized genomes
A and B, with A present at 10 times the abundance of B, `gather` on
abundance-weighted signatures will report that approximately 91% of the
metagenome is A and approximately 9% is B. (If you use `--ignore-abundance`,
then `gather` will report approximately 50:50, since the genomes are equal
sized.)

You can also get count-like information from the CSV output of `sourmash
gather`; the column `median_abund` contains the median abundance of the k-mers
in the match to the given genome.

**Buyer beware:** There are substantial challenges in doing this kind
of analysis on real metagenomic samples, relating to genome representation
and strain overlap; see [this issue](https://github.com/dib-lab/sourmash/issues/461) for a discussion.

### Computing signature similarity with angular similarity.

If signatures that have abundance information are compared with
`sourmash search` or `sourmash compare`, the default comparison is
done with
[angular similarity](https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity). This
is a distance metric based on cosine similarity, and it is suitable
for use in clustering.

For more information on the value of this kind of comparison for
metagenomics, please see the simka paper,
[Multiple comparative metagenomics using multiset k-mer counting](https://peerj.com/articles/cs-94/),
Benoit et al., 2016. Initial comparisons of metagenome similarity
approximations computed with sourmash to the output of simka suggest a
significant correlation.

**Implementation note:** Angular similarity searches cannot be done on
SBT or LCA databases currently; you have to provide lists of signature
files to `sourmash search` and `sourmash compare`. sourmash will
provide a warning if you run `sourmash search` on an LCA or SBT with
an abundance-weighted query, and automatically apply `--ignore-abundance`.

## What commands should I use?

It's not always easy to figure that out, we know! We're thinking about
Expand Down
2 changes: 1 addition & 1 deletion doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@

# General information about the project.
project = 'sourmash'
copyright = '2016, C. Titus Brown and Luiz Irber'
copyright = '2016-2020, C. Titus Brown and Luiz Irber'
author = 'C. Titus Brown and Luiz Irber'

# The version info for the project you're documenting, acts as replacement for
Expand Down
169 changes: 100 additions & 69 deletions tests/test_sourmash.py
Original file line number Diff line number Diff line change
Expand Up @@ -2998,110 +2998,141 @@ def test_gather_deduce_moltype():
assert '1.9 kbp 100.0% 100.0%' in out


def test_gather_abund_1_1():
@utils.in_thisdir
def test_gather_abund_1_1(c):
#
# make r1.fa with 1x coverage of genome s10
# make r2.fa with 10x coverage of genome s10.
# make r3.fa with 1x coverage of genome s11.
#
# nullgraph/make-reads.py -S 1 -r 200 -C 2 tests/test-data/genome-s10.fa.gz > r1.fa
# nullgraph/make-reads.py -S 1 -r 200 -C 20 tests/test-data/genome-s10.fa.gz > r2.fa
# nullgraph/make-reads.py -S 1 -r 200 -C 2 tests/test-data/genome-s11.fa.gz > r3.fa

#
# make signature s10-s11 with r1 and r3, i.e. 1:1 abundance
# make signature s10x10-s11 with r2 and r3, i.e. 10:1 abundance
#
# ./sourmash compute -k 21 --scaled 1000 --merge=1-1 -o reads-s10-s11.sig r[13].fa --track-abundance
# ./sourmash compute -k 21 --scaled 1000 --merge=10-1 -o reads-s10x10-s11.sig r[23].fa --track-abundance

with utils.TempDirectory() as location:
query = utils.get_test_data('gather-abund/reads-s10-s11.sig')
against_list = ['genome-s10', 'genome-s11', 'genome-s12']
against_list = [ 'gather-abund/' + i + '.fa.gz.sig' \
for i in against_list ]
against_list = [ utils.get_test_data(i) for i in against_list ]
query = utils.get_test_data('gather-abund/reads-s10-s11.sig')
against_list = ['genome-s10', 'genome-s11', 'genome-s12']
against_list = [ 'gather-abund/' + i + '.fa.gz.sig' \
for i in against_list ]
against_list = [ utils.get_test_data(i) for i in against_list ]

status, out, err = utils.runscript('sourmash',
['gather', query] + against_list,
in_directory=location)
status, out, err = c.run_sourmash('gather', query, *against_list)

print(out)
print(err)
print(out)
print(err)

assert '49.6% 78.5% 1.8 tests/test-data/genome-s10.fa.gz' in out
assert '50.4% 80.0% 1.9 tests/test-data/genome-s11.fa.gz' in out
assert 'genome-s12.fa.gz' not in out
# when we project s10-s11 (r1+r3), 1:1 abundance,
# onto s10 and s11 genomes with gather, we get:
# * approximately 50% of each query matching (first column, p_query)
# * approximately 80% of subject genomes contents being matched
# (this is due to the low coverage of 2 used to build queries)
# * approximately 2.0 abundance (third column, avg_abund)

assert '49.6% 78.5% 1.8 tests/test-data/genome-s10.fa.gz' in out
assert '50.4% 80.0% 1.9 tests/test-data/genome-s11.fa.gz' in out
assert 'genome-s12.fa.gz' not in out

def test_gather_abund_10_1():

@utils.in_tempdir
def test_gather_abund_10_1(c):
# see comments in test_gather_abund_1_1, above.
# nullgraph/make-reads.py -S 1 -r 200 -C 2 tests/test-data/genome-s10.fa.gz > r1.fa
# nullgraph/make-reads.py -S 1 -r 200 -C 20 tests/test-data/genome-s10.fa.gz > r2.fa
# nullgraph/make-reads.py -S 1 -r 200 -C 2 tests/test-data/genome-s11.fa.gz > r3.fa
# ./sourmash compute -k 21 --scaled 1000 --merge=1-1 -o reads-s10-s11.sig r[13].fa --track-abundance
# ./sourmash compute -k 21 --scaled 1000 --merge=10-1 -o reads-s10x10-s11.sig r[23].fa --track-abundance

with utils.TempDirectory() as location:
query = utils.get_test_data('gather-abund/reads-s10x10-s11.sig')
against_list = ['genome-s10', 'genome-s11', 'genome-s12']
against_list = [ 'gather-abund/' + i + '.fa.gz.sig' \
for i in against_list ]
against_list = [ utils.get_test_data(i) for i in against_list ]
query = utils.get_test_data('gather-abund/reads-s10x10-s11.sig')
against_list = ['genome-s10', 'genome-s11', 'genome-s12']
against_list = [ 'gather-abund/' + i + '.fa.gz.sig' \
for i in against_list ]
against_list = [ utils.get_test_data(i) for i in against_list ]

status, out, err = utils.runscript('sourmash',
['gather', query, '-o', 'xxx.csv'] \
+ against_list,
in_directory=location)
status, out, err = c.run_sourmash('gather', query, '-o', 'xxx.csv',
*against_list)

print(out)
print(err)
assert '91.0% 100.0% 14.5 tests/test-data/genome-s10.fa.gz' in out
assert '9.0% 80.0% 1.9 tests/test-data/genome-s11.fa.gz' in out
assert 'genome-s12.fa.gz' not in out
print(out)
print(err)

# check the calculations behind the above output by looking into
# the CSV.
with open(os.path.join(location, 'xxx.csv'), 'rt') as fp:
r = csv.DictReader(fp)
# when we project s10x10-s11 (r2+r3), 10:1 abundance,
# onto s10 and s11 genomes with gather, we get:
# * approximately 91% of s10 matching
# * approximately 9% of s11 matching
# * approximately 100% of the high coverage genome being matched,
# with only 80% of the low coverage genome
# * approximately 2.0 abundance (third column, avg_abund) for s11,
# and (very) approximately 20x abundance for genome s10.

assert '91.0% 100.0% 14.5 tests/test-data/genome-s10.fa.gz' in out
assert '9.0% 80.0% 1.9 tests/test-data/genome-s11.fa.gz' in out
assert 'genome-s12.fa.gz' not in out

# check the calculations behind the above output by looking into
# the CSV.
with open(c.output('xxx.csv'), 'rt') as fp:
r = csv.DictReader(fp)

overlaps = []
f_weighted_list = []
average_abunds = []
overlaps = []
f_weighted_list = []
average_abunds = []

for row in r:
overlap = float(row['intersect_bp'])
f_weighted = float(row['f_unique_weighted'])
average_abund = float(row['average_abund'])
for row in r:
overlap = float(row['intersect_bp'])
f_weighted = float(row['f_unique_weighted'])
average_abund = float(row['average_abund'])

overlaps.append(overlap)
f_weighted_list.append(f_weighted)
average_abunds.append(average_abund)
overlaps.append(overlap)
f_weighted_list.append(f_weighted)
average_abunds.append(average_abund)

weighted_calc = []
for (overlap, average_abund) in zip(overlaps, average_abunds):
prod = overlap*average_abund
weighted_calc.append(prod)
weighted_calc = []
for (overlap, average_abund) in zip(overlaps, average_abunds):
prod = overlap*average_abund
weighted_calc.append(prod)

total_weighted = sum(weighted_calc)
for prod, f_weighted in zip(weighted_calc, f_weighted_list):
assert prod / total_weighted == f_weighted, (prod, f_weighted)
total_weighted = sum(weighted_calc)
for prod, f_weighted in zip(weighted_calc, f_weighted_list):
assert prod / total_weighted == f_weighted, (prod, f_weighted)


def test_gather_abund_10_1_ignore_abundance():
@utils.in_thisdir
def test_gather_abund_10_1_ignore_abundance(c):
# see comments in test_gather_abund_1_1, above.
# nullgraph/make-reads.py -S 1 -r 200 -C 2 tests/test-data/genome-s10.fa.gz > r1.fa
# nullgraph/make-reads.py -S 1 -r 200 -C 20 tests/test-data/genome-s10.fa.gz > r2.fa
# nullgraph/make-reads.py -S 1 -r 200 -C 2 tests/test-data/genome-s11.fa.gz > r3.fa
# ./sourmash compute -k 21 --scaled 1000 --merge=1-1 -o reads-s10-s11.sig r[13].fa --track-abundance
# ./sourmash compute -k 21 --scaled 1000 --merge=10-1 -o reads-s10x10-s11.sig r[23].fa --track-abundance

with utils.TempDirectory() as location:
query = utils.get_test_data('gather-abund/reads-s10x10-s11.sig')
against_list = ['genome-s10', 'genome-s11', 'genome-s12']
against_list = [ 'gather-abund/' + i + '.fa.gz.sig' \
for i in against_list ]
against_list = [ utils.get_test_data(i) for i in against_list ]
query = utils.get_test_data('gather-abund/reads-s10x10-s11.sig')
against_list = ['genome-s10', 'genome-s11', 'genome-s12']
against_list = [ 'gather-abund/' + i + '.fa.gz.sig' \
for i in against_list ]
against_list = [ utils.get_test_data(i) for i in against_list ]

status, out, err = utils.runscript('sourmash',
['gather', query] \
+ ['--ignore-abundance'] + \
against_list,
in_directory=location)
status, out, err = c.run_sourmash('gather', query,
'--ignore-abundance',
*against_list)

print(out)
print(err)
assert all(('57.2% 100.0%', 'tests/test-data/genome-s10.fa.gz' in out))
assert all(('42.8% 80.0%', 'tests/test-data/genome-s11.fa.gz' in out))
assert 'genome-s12.fa.gz' not in out
print(out)
print(err)

# when we project s10x10-s11 (r2+r3), 10:1 abundance,
# onto s10 and s11 genomes with gather --ignore-abundance, we get:
# * approximately 50% of s10 and s11 matching (first column)
# * approximately 100% of the high coverage genome being matched,
# with only 80% of the low coverage genome
# no abundance-weighted information is provided here. @CTB check?

assert all(('57.2% 100.0%', 'tests/test-data/genome-s10.fa.gz' in out))
assert all(('42.8% 80.0%', 'tests/test-data/genome-s11.fa.gz' in out))
assert 'genome-s12.fa.gz' not in out


@utils.in_tempdir
Expand Down