187. Repeated DNA Sequences

LeetCode 187. Repeated DNA Sequences

Description

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.

For example, "ACGAATTCCG" is a DNA sequence.

When studying DNA , it is useful to identify repeated sequences within the DNA.

Given a string s that represents a DNA sequence , return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]

Constraints:

1 <= s.length <= 10^5
s[i] is either 'A', 'C', 'G', or 'T'.

Solution

We first encode each type of nucleotide into an interger between 0 (00) and 3 (11). Then, we slide a window with length of 10 over the given string and use a hash table to check duplicates. The encoding precedures are as follows:

Map 'A', 'C', 'G', and 'T' into 0, 1, 2 and 3, respectively, and use an integer bitmask to represent the substring a.k.a the sliding window in binary form;
Left shift bitmask by 2 bits, since we are using 00~11 to represent each type of nucleotide;
Add the incoming nucleotide (binary form) onto the bitmask;
Clear the highest 2 bits by & (1111 1111 1111 1111 1111).

Complexity

Time complexity: $$O(n-L) \sim O(n)$$
Space complexity: $$O(n-L) \sim O(n)$$

Code

func findRepeatedDnaSequences(s string) []string {
	var ans []string
	bitSet, dict, L, bitmask := map[int]int{}, map[byte]int{'A': 0, 'T': 1, 'C': 2, 'G': 3}, 10, 0
	for i := 0; i < len(s); i++ {
		bitmask = bitmask<<2 | dict[s[i]]
		bitmask &= (1<<32 - 1) >> (32 - L*2) // 1111 1111 1111 1111 1111
		if i >= L-1 {                        // start from s[0:10)
			if bitSet[bitmask] == 1 {
				ans = append(ans, s[i-L+1:i+1])
			}
			bitSet[bitmask]++
		}
	}
	return ans
}

Reference

二进制方法

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

187.-repeated-dna-sequences.md

187.-repeated-dna-sequences.md

187. Repeated DNA Sequences

LeetCode 187. Repeated DNA Sequences

Description

Tags

Solution

Complexity

Code

Reference

Files

187.-repeated-dna-sequences.md

Latest commit

History

187.-repeated-dna-sequences.md

File metadata and controls

187. Repeated DNA Sequences

LeetCode 187. Repeated DNA Sequences

Description

Tags

Solution

Complexity

Code

Reference