Skip to content

Commit

Permalink
MRG: minor refactoring of gather code, small doc updates (#2953)
Browse files Browse the repository at this point in the history
This PR eliminates some unnecessary / duplicate code and renames some
variables to be more consistent.

Fixes #1831.
Fixes #1737.
Fixes #2835.
  • Loading branch information
ctb committed Jan 30, 2024
1 parent eee9bdb commit 846e8ca
Show file tree
Hide file tree
Showing 4 changed files with 56 additions and 35 deletions.
26 changes: 14 additions & 12 deletions doc/classifying-signatures.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,15 +190,17 @@ first as "abundance projection" and the second as "angular similarity".

`sourmash gather` can report approximate abundance information for
containment queries against genome databases. This will give you
numbers that (approximately) match what you get from counting mapped
reads.
numbers that (approximately) match what you get from counting the coverage
of each contig by mapping reads.

If you create your input signatures with `-p abund`,
`sourmash gather` will use that information
to calculate an abundance-weighted result. This will weight
each match to a hash value by the multiplicity of the hash value in
If you create your query signature with `-p abund`,
`sourmash gather` will use the resulting k-mer multiplicity information
to calculate an abundance-weighted result, weighting
each hash value match by the multiplicity of the hash value in
the query signature. You can turn off this behavior with
`--ignore-abundance`.
`--ignore-abundance`. The abundance is reported as column `avg_abund`
in the console output, and columns `average_abund`, `median_abund`, and
`std_abund` in the CSV output.

For example, if you have a metagenome composed of two equal sized genomes
A and B, with A present at 10 times the abundance of B, `gather` on
Expand Down Expand Up @@ -347,7 +349,7 @@ First, we make some synthetic data sets:
then we make signature s10-s11 with r1 and r3, i.e. 1:1 abundance, and
make signature s10x10-s11 with r2 and r3, i.e. 10:1 abundance.

### A first experiment: 1:1 abundance.
### A first experiment: 1:1 abundance ratio.

When we project r1+r3, 1:1 abundance, onto s10, s11, and s12 genomes
with gather:
Expand All @@ -367,10 +369,10 @@ overlap p_query p_match avg_abund

* approximately 50% of each query matching (first column, `p_query`)
* approximately 80% of subject genome's contents being matched (this is due to the low coverage of 2 used to build queries; `p_match`)
* approximately 2.0 abundance (third column, `avg_abund`)
* approximately 2.0 coverage (third column, `avg_abund`)
* no match to genome s12.

### A second experiment: 10:1 abundance.
### A second experiment: 10:1 abundance ratio.

When we project r2+r3, 10:1 abundance, onto s10, s11, and s12 genomes
with gather:
Expand All @@ -391,11 +393,11 @@ overlap p_query p_match avg_abund
* approximately 91% of s10 matching
* approximately 9% of s11 matching
* approximately 100% of the high coverage genome being matched, with only 80% of the low coverage genome
* approximately 2.0 abundance (third column, avg_abund) for s11, and (very) approximately 20x abundance for genome s10.
* approximately 2.0 coverage (third column, avg_abund) for s11, and (very) approximately 20x coverage for genome s10.

The cause of the poor approximation for genome s10 is unclear; it
could be due to low coverage, or small genome size, or something
else. More investigation needed.
else. More investigation is needed.

## Appendix C: sourmash gather output examples

Expand Down
6 changes: 3 additions & 3 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -373,9 +373,9 @@ the recovered matches hit 45.6% of the query k-mers (unweighted).
```

For each match,
* 'overlap', the first column, is the estimated number of k-mers shared between the match and the query.
* 'p_query' is the _percentage_ of the query that overlaps with the match; it is the amount of the metagenome "explained" by this match.
* 'p_match' is the percentage of the _match_ that overlaps with the query; it is the "detection" of the match in the metagenome.
* 'overlap', the first column, is the estimated number of base pairs shared between the match and the query, based on the number of shared hashes.
* 'p_query' is the _percentage_ of the query that overlaps with the match; it is the amount of the metagenome "explained" by this match. It is typically a lower bound on the percent of metagenomes reads that will map to this genome.
* 'p_match' is the percentage of the _match_ that overlaps with the query; it is the "detection" of the match in the metagenome. It is typically a lower bound on the number of base pairs that will be covered by read mapping.

Quite a bit more information per match row is available in the CSV
output saved with `-o`; for details, see
Expand Down
13 changes: 12 additions & 1 deletion src/sourmash/cli/gather.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
sourmash gather query.sig [ list of signatures or SBTs ]
```
Example output:
Example output for an unweighted/noabund query:
```
overlap p_query p_match
--------- ------- --------
Expand All @@ -35,6 +35,17 @@
0.7 Mbp 5.3%% 17.6%% AE017285.1 Desulfovibrio vulgaris sub...
```
Example output for a weighted query:
```
overlap p_query p_match avg_abund
--------- ------- ------- ---------
9.3 Mbp 0.8% 97.5% 6.7 NC_007951.1 Burkholderia xenovorans ...
7.3 Mbp 2.3% 99.9% 23.9 NC_003272.1 Nostoc sp. PCC 7120 DNA,...
7.0 Mbp 8.9% 100.0% 94.5 BX119912.1 Rhodopirellula baltica SH...
6.6 Mbp 1.4% 100.0% 16.3 NC_009972.1 Herpetosiphon aurantiacu...
...
```
The command line option `--threshold-bp` sets the threshold below
which matches are no longer reported; by default, this is set to
50kb. see the Appendix in Classifying Signatures [1] for details.
Expand Down
46 changes: 27 additions & 19 deletions src/sourmash/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -433,18 +433,27 @@ class GatherResult(PrefetchResult):
sum_weighted_found: int = None
total_weighted_hashes: int = None

gather_write_cols = ['intersect_bp', 'f_orig_query', 'f_match', 'f_unique_to_query',
'f_unique_weighted','average_abund', 'median_abund', 'std_abund', 'filename', # here we use 'filename'
'name', 'md5', 'f_match_orig', 'unique_intersect_bp', 'gather_result_rank',
'remaining_bp', 'query_filename', 'query_name', 'query_md5', 'query_bp', 'ksize',
'moltype', 'scaled', 'query_n_hashes', 'query_abundance', 'query_containment_ani',
'match_containment_ani', 'average_containment_ani', 'max_containment_ani',
gather_write_cols = ['intersect_bp', 'f_orig_query', 'f_match',
'f_unique_to_query',
'f_unique_weighted','average_abund',
'median_abund', 'std_abund', 'filename',
'name', 'md5',
'f_match_orig', 'unique_intersect_bp',
'gather_result_rank', 'remaining_bp',
'query_filename', 'query_name', 'query_md5',
'query_bp', 'ksize', 'moltype', 'scaled',
'query_n_hashes', 'query_abundance',
'query_containment_ani',
'match_containment_ani',
'average_containment_ani',
'max_containment_ani',
'potential_false_negative',
'n_unique_weighted_found', 'sum_weighted_found',
'n_unique_weighted_found',
'sum_weighted_found',
'total_weighted_hashes']

ci_cols = ["query_containment_ani_low", "query_containment_ani_high",
"match_containment_ani_low", "match_containment_ani_high"]
"match_containment_ani_low", "match_containment_ani_high"]

gather_write_cols_ci = gather_write_cols + ci_cols

Expand Down Expand Up @@ -671,9 +680,10 @@ def __init__(self, query, counters, *,
# do we pay attention to abundances?
query_mh = query.minhash
query_hashes = query_mh.hashes
orig_query_abunds = { k: 1 for k in query_hashes }
if track_abundance:
orig_query_abunds = query_hashes
else:
orig_query_abunds = { k: 1 for k in query_hashes }

# adjust for not found...
if noident_mh is None: # create empty
Expand Down Expand Up @@ -721,9 +731,9 @@ def _update_scaled(self, scaled):
orig_query_abunds = self.orig_query_abunds
self.noident_query_sum_abunds = sum(( orig_query_abunds[k] \
for k in self.noident_mh.hashes ))
self.sum_abunds = sum(( orig_query_abunds[k] \
self.total_weighted_hashes = sum(( orig_query_abunds[k] \
for k in self.orig_query_mh.hashes ))
self.sum_abunds += self.noident_query_sum_abunds
self.total_weighted_hashes += self.noident_query_sum_abunds

if max_scaled != scaled:
return max_scaled
Expand All @@ -746,7 +756,6 @@ def __next__(self):
cmp_scaled = self.cmp_scaled

# will not be changed::
track_abundance = self.track_abundance
threshold_bp = self.threshold_bp
orig_query_abunds = self.orig_query_abunds

Expand All @@ -770,7 +779,7 @@ def __next__(self):

# retrieve various saved things, after potential downsampling
orig_query_mh = self.orig_query_mh
sum_abunds = self.sum_abunds
total_weighted_hashes = self.total_weighted_hashes
noident_mh = self.noident_mh
orig_query_len = len(orig_query_mh) + len(noident_mh)

Expand All @@ -784,29 +793,28 @@ def __next__(self):
new_query = SourmashSignature(new_query_mh)

# compute weighted information for remaining query hashes
query_hashes = set(query_mh.hashes) - set(found_mh.hashes)
query_hashes = set(new_query_mh.hashes)
n_weighted_missed = sum((orig_query_abunds[k] for k in query_hashes))
n_weighted_missed += self.noident_query_sum_abunds
sum_weighted_found = sum_abunds - n_weighted_missed
sum_weighted_found = total_weighted_hashes - n_weighted_missed

# build a GatherResult
result = GatherResult(self.orig_query, best_match,
cmp_scaled=scaled,
filename=filename,
gather_result_rank=self.result_n,
gather_querymh=query.minhash,
ignore_abundance= not track_abundance,
ignore_abundance=not self.track_abundance,
threshold_bp=threshold_bp,
orig_query_len=orig_query_len,
orig_query_abunds = self.orig_query_abunds,
orig_query_abunds=self.orig_query_abunds,
estimate_ani_ci=self.estimate_ani_ci,
sum_weighted_found=sum_weighted_found,
total_weighted_hashes=sum_abunds,
total_weighted_hashes=total_weighted_hashes,
)

self.result_n += 1
self.query = new_query
self.orig_query_mh = orig_query_mh

return result

Expand Down

0 comments on commit 846e8ca

Please sign in to comment.