Skip to content

Commit

Permalink
[MRG] Add FrozenMinHash (#1508)
Browse files Browse the repository at this point in the history
* have the 'find' function for SBTs return signatures

* fix majority of tests

* comment & then fix test

* torture the tests into working

* split find and _find_nodes to take different kinds of functions

* redo 'find' on index

* refactor lca_db to use new find

* refactor SBT to use new find

* comment/cleanup

* refactor out common code

* fix up gather

* use 'passes' properly

* attempted cleanup

* minor fixes

* get a start on correct downsampling

* adjust tree downsampling for regular minhashes, too

* remove now-unused search functions in sbtmh

* refactor categorize to use new find

* cleanup and removal

* remove redundant code in lca_db

* remove redundant code in SBT

* add notes

* remove more unused code

* refactor most of the test_sbt tests

* fix one minor issue

* fix jaccard calculation in sbt

* check for compatibility of search fn and query signature

* switch tests over to jaccard similarity, not containment

* fix test

* remove test for unimplemented LCA_Database.find method

* document threshold change; update test

* refuse to run abund signatures

* flatten sigs internally for gather

* reinflate abundances for saving

* fix problem where sbt indices coudl be created with abund signatures

* more

* split flat and abund search

* make ignore_abundance work again for categorize

* turn off best-only, since it triggers on self-hits.

* add test: 'sourmash index' flattens sigs

* add note about something to test

* fix typo; still broken tho

* location is now a property

* move search code into search.py

* remove redundant scaled checking code

* best-only now works properly for two tests

* 'fix' tests by removing v1 and v2 SBT compatibility

* simplify (?) downsampling code

* require keyword args in MinHash.downsample(...)

* fix bug with downsample

* require keyword args in MinHash.downsample(...)

* fix test to use proper downsampling, reverse order to match scaled

* add test for revealed bug

* remove unnecessary comment

* flatten subject MinHash, too

* add testme comment

* clean up sbt find

* clean up lca find

* add IndexSearchResult namedtuple for search and gather results

* add more tests for Index classes

* add tests for subj & query num downsampling

* tests for Index.search_abund

* refactor a bit

* refactor make_jaccard_search_query; start tests

* even more tests

* test collect, best_only

* more search tests

* remove unnec space

* add minor comment

* deal with status == None on SystemExit

* upgrade and simplify categorize

* restore test

* merge

* fix abundance search in SBT for categorize

* code cleanup and refactoring; check for proper error messages

* add explicit test for incompatible num

* refactor MinHash.downsample

* deal with status == None on SystemExit

* fix test

* fix comment mispelling

* properly pass kwargs; fix search_sbt_index

* add simple tests for SBT load and search API

* allow arbitrary kwargs for LCA_DAtabase.find

* add testing of passthru-kwargs

* re-enable test

* add notes to update docstrings

* docstring updates

* fix test

* fix location reporting in prefetch

* fix prefetch location by fixing MultiIndex

* temporary prefetch_gather intervention

* 'gather' only returns best match

* turn prefetch on by default, for now

* better tests for gather --save-unassigned

* remove unused print

* remove unnecessary check-me comment

* clear out docstring

* SBT search doesn't work on v1 and v2 SBTs b/c no min_n_below

* start adding tests

* test some basic prefetch stuff

* update index for prefetch

* add fairly thorough tests

* fix my dumb mistake with gather

* simplify, refactor, fix

* fix remaining tests

* propogate ValueErrors better

* fix tests

* flatten prefetch queries

* fix for genome-grist alpha test

* fix threshold bugarooni

* fix gather/prefetch interactions

* fix sourmash prefetch return value

* minor fixes

* pay proper attention to threshold

* cleanup and refactoring

* remove unnecessary 'scaled'

* minor cleanup

* added LazyLinearLindex and prefetch --linear

* fix abundance problem

* save matches to a directory

* test for saving matches to a directory

* add a flexible progressive signature output class

* add tests for .sig.gz and .zip outputs

* update save_signatures code; add tests; use in gather and search too

* update comment

* cleanup and refactor of SaveSignaturesToLocation code

* docstrings & cleanup

* add 'run' and 'runtmp' test fixtures

* remove unnecessary track_abundance fixture call

* restore original;

* linear and prefetch fixtures + runtmp

* fix use of runtmp

* copy over SaveSignaturesToLocation code from other branch

* docs for sourmash prefetch

* more doc

* minor edits

* Re-implement the actual gather protocol with a cleaner interface. (#1489)

* initial refactor of CounterGather stuff

* refactor into peek and consume

* move next method over to query specific class

* replace gather implementation with new CounterGather

* many more tests for CounterGather

* remove scaled arg from peek

* open-box test for counter internal data structures

* add num query & subj tests

* add repr; add tests; support stdout

* refactor signature saving to use new sourmash_args collection saving

* specify utf-8 encoding for output

* add flexible output to compute/sketch

* add test to trigger rust panic

* test search --save-matches

* add --save-prefetch to sourmash gather

* remove --no-prefetch option :)

* added --save-prefetch functionality

* add back a mostly-functioning --no-prefetch argument :)

* add --no-prefetch back in

* check for JSON in first byte of LCA DB file

* start adding linear tests

* use fixtures to test prefetch and linear more thoroughly

* comments, etc

* upgrade docs for --linear and --prefetch

* 'fix' issue and test

* fix a last test ;)

* Update doc/command-line.md

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update src/sourmash/cli/sig/rename.py

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update tests/test_sourmash_args.py

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update tests/test_sourmash_args.py

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update tests/test_sourmash_args.py

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update tests/test_sourmash_args.py

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update tests/test_sourmash_args.py

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update doc/command-line.md

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* write tests for LazyLinearIndex

* add some basic prefetch tests

* properly test linear!

* add more tests for LazyLinearIndex

* test zipfile bool

* remove unnecessary try/except; comment

* fix signatures() call

* fix --prefetch snafu; doc

* do not overwrite signature even if duplicate md5sum (#1497)

* try adding loc to return values from Index.find

* made use of new IndexSearchResult.find throughout

* adjust note

* provide signatures_with_location on all Index objects

* cleanup and fix

* Update doc/command-line.md

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update doc/command-line.md

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* fix bug around --save-prefetch with multiple databases

* comment/doc minor updates

* initial trial implementation of ImmutableMinHash

* fix tests

* provide our own pickle for ImmutableMinHash

* ok, a few more plcaes to change.

* rename to FrozenMinHash per luiz

* finish renaming, add some tests

* thanks, I hate the old behavior

* copy.copy is no longer needed

* docs and an explicit 'frozen' method

* switch to using 'to_frozen' and 'to_mutable'

Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
  • Loading branch information
3 people committed May 15, 2021
1 parent 4c48f39 commit 9712009
Show file tree
Hide file tree
Showing 8 changed files with 152 additions and 8 deletions.
5 changes: 3 additions & 2 deletions src/sourmash/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -1106,7 +1106,8 @@ def prefetch(args):

# iterate over signatures in db one at a time, for each db;
# find those with sufficient overlap
noident_mh = copy.copy(query_mh)
noident_mh = query_mh.to_mutable()

did_a_search = False # track whether we did _any_ search at all!
for dbfilename in args.databases:
notify(f"loading signatures from '{dbfilename}'")
Expand Down Expand Up @@ -1164,7 +1165,7 @@ def prefetch(args):
notify(f"saved {matches_out.count} matches to CSV file '{args.output}'")
csvout_fp.close()

matched_query_mh = copy.copy(query_mh)
matched_query_mh = query_mh.to_mutable()
matched_query_mh.remove_many(noident_mh.hashes)
notify(f"of {len(query_mh)} distinct query hashes, {len(matched_query_mh)} were found in matches above threshold.")
notify(f"a total of {len(noident_mh)} query hashes remain unmatched.")
Expand Down
106 changes: 105 additions & 1 deletion src/sourmash/minhash.py
Original file line number Diff line number Diff line change
Expand Up @@ -588,7 +588,7 @@ def __add__(self, other):
if self.num != other.num:
raise TypeError(f"incompatible num values: self={self.num} other={other.num}")

new_obj = self.__copy__()
new_obj = self.to_mutable()
new_obj += other
return new_obj

Expand Down Expand Up @@ -645,3 +645,107 @@ def moltype(self): # TODO: test in minhash tests
return 'hp'
else:
return 'DNA'

def to_mutable(self):
"Return a copy of this MinHash that can be changed."
return self.__copy__()

def to_frozen(self):
"Return a frozen copy of this MinHash that cannot be changed."
new_mh = self.__copy__()
new_mh.__class__ = FrozenMinHash
return new_mh


class FrozenMinHash(MinHash):
def add_sequence(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def add_kmer(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def add_many(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def remove_many(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def add_hash(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def add_hash_with_abundance(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def clear(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def remove_many(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def set_abundances(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def add_protein(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def downsample(self, *, num=None, scaled=None):
if scaled and self.scaled == scaled:
return self
if num and self.num == num:
return self

return MinHash.downsample(self, num=num, scaled=scaled).to_frozen()

def flatten(self):
if not self.track_abundance:
return self
return MinHash.flatten(self).to_frozen()

def __iadd__(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def merge(self, *args, **kwargs):
raise TypeError('FrozenMinHash does not support modification')

def to_mutable(self):
"Return a copy of this MinHash that can be changed."
mut = MinHash.__new__(MinHash)
state_tup = self.__getstate__()

# is protein/hp/dayhoff?
if state_tup[2] or state_tup[3] or state_tup[4]:
state_tup = list(state_tup)
# adjust ksize.
state_tup[1] = state_tup[1] * 3
mut.__setstate__(state_tup)
return mut

def to_frozen(self):
"Return a frozen copy of this MinHash that cannot be changed."
return self

def __setstate__(self, tup):
"support pickling via __getstate__/__setstate__"
(n, ksize, is_protein, dayhoff, hp, mins, _, track_abundance,
max_hash, seed) = tup

self.__del__()

hash_function = (
lib.HASH_FUNCTIONS_MURMUR64_DAYHOFF if dayhoff else
lib.HASH_FUNCTIONS_MURMUR64_HP if hp else
lib.HASH_FUNCTIONS_MURMUR64_PROTEIN if is_protein else
lib.HASH_FUNCTIONS_MURMUR64_DNA
)

scaled = _get_scaled_for_max_hash(max_hash)
self._objptr = lib.kmerminhash_new(
scaled, ksize, hash_function, seed, track_abundance, n
)
if track_abundance:
MinHash.set_abundances(self, mins)
else:
MinHash.add_many(self, mins)

def __copy__(self):
return self
1 change: 1 addition & 0 deletions src/sourmash/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -354,6 +354,7 @@ def gather_databases(query, counters, threshold_bp, ignore_abundance):

# construct a new query, subtracting hashes found in previous one.
new_query_mh = query.minhash.downsample(scaled=cmp_scaled)
new_query_mh = new_query_mh.to_mutable()
new_query_mh.remove_many(set(found_mh.hashes))
new_query = SourmashSignature(new_query_mh)

Expand Down
4 changes: 2 additions & 2 deletions src/sourmash/signature.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

from .logging import error
from . import MinHash
from .minhash import to_bytes
from .minhash import to_bytes, FrozenMinHash
from ._lowlevel import ffi, lib
from .utils import RustObject, rustcall, decode_str

Expand Down Expand Up @@ -42,7 +42,7 @@ def __init__(self, minhash, name="", filename=""):

@property
def minhash(self):
return MinHash._from_objptr(
return FrozenMinHash._from_objptr(
self._methodcall(lib.signature_first_mh)
)

Expand Down
3 changes: 2 additions & 1 deletion tests/test_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -1326,6 +1326,7 @@ def is_found(ss, xx):

def _consume_all(query_mh, counter, threshold_bp=0):
results = []
query_mh = query_mh.to_mutable()

last_intersect_size = None
while 1:
Expand Down Expand Up @@ -1891,7 +1892,7 @@ def test_counter_gather_3_test_consume():

## round 1

cur_query = copy.copy(query_ss.minhash)
cur_query = query_ss.minhash.to_mutable()
(sr, intersect_mh) = counter.peek(cur_query)
assert sr.signature == match_ss_1
assert len(intersect_mh) == 10
Expand Down
37 changes: 37 additions & 0 deletions tests/test__minhash.py → tests/test_minhash.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
import sourmash
from sourmash.minhash import (
MinHash,
FrozenMinHash,
hash_murmur,
_get_scaled_for_max_hash,
_get_max_hash_for_scaled,
Expand Down Expand Up @@ -1908,3 +1909,39 @@ def test_max_containment_equal():
assert mh2.contained_by(mh1) == 1
assert mh1.max_containment(mh2) == 1
assert mh2.max_containment(mh1) == 1


def test_frozen_and_mutable_1(track_abundance):
# mutable minhashes -> mutable minhashes creates new copy
mh1 = MinHash(0, 21, scaled=1, track_abundance=track_abundance)
mh2 = mh1.to_mutable()

mh1.add_hash(10)
assert 10 not in mh2.hashes


def test_frozen_and_mutable_2(track_abundance):
# check that mutable -> frozen are separate
mh1 = MinHash(0, 21, scaled=1, track_abundance=track_abundance)
mh1.add_hash(10)

mh2 = mh1.to_frozen()
assert 10 in mh2.hashes
mh1.add_hash(11)
assert 11 not in mh2.hashes


def test_frozen_and_mutable_3(track_abundance):
# check that mutable -> frozen -> mutable are all separate from each other
mh1 = MinHash(0, 21, scaled=1, track_abundance=track_abundance)
mh1.add_hash(10)

mh2 = mh1.to_frozen()
assert 10 in mh2.hashes
mh1.add_hash(11)
assert 11 not in mh2.hashes

mh3 = mh2.to_mutable()
mh3.add_hash(12)
assert 12 not in mh2.hashes
assert 12 not in mh1.hashes
2 changes: 1 addition & 1 deletion tests/test_prefetch.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,7 @@ def test_prefetch_nomatch_hashes(runtmp, linear_gather):
ss47 = sourmash.load_one_signature(sig47, ksize=31)
ss63 = sourmash.load_one_signature(sig63, ksize=31)

remain = ss47.minhash
remain = ss47.minhash.to_mutable()
remain.remove_many(ss63.minhash.hashes)

ss = sourmash.load_one_signature(nomatch_out)
Expand Down
2 changes: 1 addition & 1 deletion tests/test_sourmash.py
Original file line number Diff line number Diff line change
Expand Up @@ -3111,7 +3111,7 @@ def test_gather_f_match_orig(runtmp, linear_gather, prefetch_gather):
print(runtmp.last_result.err)

combined_sig = sourmash.load_one_signature(testdata_combined, ksize=21)
remaining_mh = copy.copy(combined_sig.minhash)
remaining_mh = combined_sig.minhash.to_mutable()

def approx_equal(a, b, n=5):
return round(a, n) == round(b, n)
Expand Down

0 comments on commit 9712009

Please sign in to comment.