LeetCode 187. Repeated DNA Sequences
The DNA sequence is composed of a series of nucleotides abbreviated as 'A'
, 'C'
, 'G'
, and 'T'
.
- For example,
"ACGAATTCCG"
is a DNA sequence.
When studying DNA , it is useful to identify repeated sequences within the DNA.
Given a string s
that represents a DNA sequence , return all the 10
-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
Example 1:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]
Constraints:
1 <= s.length <= 10^5
s[i]
is either'A'
,'C'
,'G'
, or'T'
.
Hash Table, Bit Manipulation
We first encode each type of nucleotide into an interger between 0 (00) and 3 (11). Then, we slide a window with length of 10 over the given string and use a hash table to check duplicates. The encoding precedures are as follows:
- Map
'A'
,'C'
,'G'
, and'T'
into 0, 1, 2 and 3, respectively, and use an integerbitmask
to represent the substring a.k.a the sliding window in binary form; - Left shift
bitmask
by 2 bits, since we are using 00~11 to represent each type of nucleotide; - Add the incoming nucleotide (binary form) onto the
bitmask
; - Clear the highest 2 bits by
& (1111 1111 1111 1111 1111)
.
- Time complexity:
$$O(n-L) \sim O(n)$$ - Space complexity:
$$O(n-L) \sim O(n)$$
func findRepeatedDnaSequences(s string) []string {
var ans []string
bitSet, dict, L, bitmask := map[int]int{}, map[byte]int{'A': 0, 'T': 1, 'C': 2, 'G': 3}, 10, 0
for i := 0; i < len(s); i++ {
bitmask = bitmask<<2 | dict[s[i]]
bitmask &= (1<<32 - 1) >> (32 - L*2) // 1111 1111 1111 1111 1111
if i >= L-1 { // start from s[0:10)
if bitSet[bitmask] == 1 {
ans = append(ans, s[i-L+1:i+1])
}
bitSet[bitmask]++
}
}
return ans
}