-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design of hash(::BioSequence). What to do? #243
Comments
😬 This is not good. We might want a separate issue to get a bug-fix on that specifically.
Regarding this, I have a question - is changing the content of a hash breaking, or only the behavior. That is, as long as we fulfill the hashing contract(s), if we change the underlying behavior to improve speed, is that breaking? If the answer is no, I propose that we do (1) as a stop-gap to fix the current bug, since it's simple, and then we can work on further optimization separately.
Isn't there some nuance about the difference between Another (possibly orthogonal) question - do we want a kmer of a sequence to be I am interested in this topic because I remain interested in MinHash and the things that it enables (there's a bunch of neat microbiome work on this by Titus Brown), but I'm not super knowledgeable about the computer science of hashing. Happy to be involved though. |
I'm pretty sure changing the value of a hash is not breaking. That's fine. Yeah, let's do option 1 as a stop-gap. W.r.t the difference between
So if we choose that option, it implies that we can have:
which is also weird. |
Current hashing behaviour hashes equal biosequences with different encodings to different values, which violate the interface of hash. This is a fix until (and if) we figure out a better method of hashing. See issue BioJulia#243
Current hashing behaviour hashes equal biosequences with different encodings to different values, which violate the interface of hash. This is a fix until (and if) we figure out a better method of hashing. See issue #243
Okay here are some before/after benchmarks:
So 3x for short seqs, ~25x for long sequences. Sigh. To speed it up, the only way I can think of is this:
It's doable, but quite a bit of work. |
And all of this would have to be done at the same time to maintain the contract? That is, we can't deal with point 2 first, then handle point 3 and point 4 later? |
Unfortunately, all have to be implemented simultaneously |
The problem with the suggestion above is that the only BioSequences I can imagine anyone wanting to hash in large numbers (and thus where the performance really matters) are Kmers. Hashing of Kmers needs to be really fast. |
Excellent point, you are right. We should do the iterator thing above, but prioritise kmers - that should be extremely fast. |
Current hashing behaviour of Kmers is: # TODO: Ensure this is the right way to go.
# See https://github.com/BioJulia/BioSequences.jl/pull/121#discussion_r475234270
Base.hash(x::Kmer{A,K,N}, h::UInt) where {A,K,N} = hash(x.data, h ⊻ K) Which so it just hashes the underlying Tuple. which is implemented like so: hash(t::Tuple, h::UInt) = hash(t[1], hash(tail(t), h))
function hash(t::Any32, h::UInt)
out = h + tuplehash_seed
for i = length(t):-1:1
out = hash(t[i], out)
end
return out
end |
Also maybe the descisions we make here should be added to the docs on the definition of a BioSequence, as it's really important we maintain this contract if people are going to write generic code and mix packages. |
Trying to think of why this hasnt been a big problem for me already in practical setting, and the only answer I can think of is when use a sequence type with a given alphabet I use it for my entire scripts / application, and given Set and Dict are parametric on the sequence type, any hashing has all been with one consistent type, and so one partiular hashing scheme. |
From discussion on uBioinfo Slack and @ctb:
|
On further consideration and re-reading the previous issue, this isn't super relevant for this issue - I don't think we want to pull in murmurhash as a dependency, and to be fully compatible we'd need to hash the ASCII, which doesn't make sense for |
BioSequences already implement murmur3 in the hash.jl file :) |
So I've been trying to think about this for kmers, and it looks like on the fly translation of bit patterns seems sensible, and it makes sense in that promtion rules of two different alphabets - a two bit and a four bit one favour the alphabet that covers the larger set of symbols. I'm struggling to come up with an elegant set of bit flips to do the 2bit -> 4 bit translation. The best I have so far is these two ideas: const fourbitnucs = (UInt64(1), UInt64(2), UInt64(4), UInt64(8))
function translate_bits(bits::UInt64, from::DNAAlphabet{2}, to::DNAAlphabet{4})
bits = bits
tbits = zero(UInt64)
@inbounds for i in 0:31
element = (bits >> (i * 2)) & UInt64(3)
newelement = foutbitnucs[element + one(UInt64)]
tbits = (tbits << (i * 4)) | newelement
end
end
function translate_bits2(bits::UInt64, from::DNAAlphabet{2}, to::DNAAlphabet{4})
bits = bits
tbits = zero(UInt64)
@inbounds for i in 0:31
element = bits & UInt64(3)
newelement = fourbitnucs[element + one(UInt64)]
tbits = (tbits << 4) | newelement
bits = bits >> 2
end
return tbits
end I'm not a fan of the loop though, I was hoping for a clever combination of shifts and masks as in the bit-parallel counting code. EDIT: These functions are also wrong, as two UInt64's should be output for the single input UInt64, but you get where my thought processes were going with the translation. EDIT EDIT: |
A lookup table also has the advantage that plebs like me can understand it 😆 (bit flipping still seems like black magic to me) |
Here are some relevant bitpacking code: @inline function four_to_two_bit(x::UInt64)::UInt32
m1 = 0x1111111111111111
m2 = m1 << 1
m3 = m1 | m2
y = (x >> 3) & m1
y |= (x >> 2) & m2
y |= (x >> 1) & m3
pack(y)
end
@inline function pack(x::UInt64)::UInt32
m1 = 0x0f0f0f0f0f0f0f0f
m2 = 0x00ff00ff00ff00ff
m3 = 0x0000ffff0000ffff
m4 = 0x00000000ffffffff
x = (x & m1) | (x & ~m1) >> 2
x = (x & m2) | (x & ~m2) >> 4
x = (x & m3) | (x & ~m3) >> 8
x = (x & m4) | (x & ~m4) >> 16
x % UInt32
end
@inline function two_to_four_bits(x::UInt32)::UInt64
m = 0x1111111111111111
y = expand(x)
z = (y & m) << 1 | (m & ~y)
m2 = y & (m << 1)
m2 = m2 | m2 >> 1
(z & m2) << 2 | (z & ~m2)
end
@inline function expand(x::UInt32)::UInt64
m1 = 0x000000000000ffff
m2 = 0x000000ff000000ff
m3 = 0x000f000f000f000f
m4 = 0x0303030303030303
y = x % UInt64
y = (y & m1) | (y & ~m1) << 16
y = (y & m2) | (y & ~m2) << 8
y = (y & m3) | (y & ~m3) << 4
(y & m4) | (y & ~m4) << 2
end |
Like I said... 🧙 🪄 |
I'm thinking the only solution is to just define |
For the next breaking change, we might want to reconsider what equality means for biosequences:
Currently I lean towards erroring on |
So the design of BioSequences makes it difficult to implement efficient and correct hashing.
We want efficient hashing, because hashing underlies operations like putting stuff in a
Set
orDict
, which users expect to be fast.The issue is
isequal(a, b) === isequal(hash(a, x), hash(b, x))
isequal
ought to allow objects of different types to be equal if it represents the same value. E.g.isequal(0, 0.0)
. Breaking this leads to lots of confusion.So, to follow these two rules
BioSequence
should beisequal
if they have the same content, i.e.isequal(dna"TAG", DNAKmer("TAG"))
Now, how do we get two
BioSequence
s with arbitrary encoding to hash equivalently? As I see it, it means we can't hash the encoded data, because the encoded data may vary between subtypes. However, this is presumably the only way to avoid decoding, which is the only way to make hashing fast!Incidentally, the current implementation of
hash
forLongSequence
is broken:Here are some possible solutions:
LongSequence
's. When hashing any other type, we re-code it to that encoding before hashing. This will keep LongSequence hashing fast, but will make everything else both slower and much more complex. If we do this, we also need to recodeLongSequence{NucleicAcidAlphabet{2}}
.There may be other solutions.
The text was updated successfully, but these errors were encountered: