Implement a fast algorithm which can solve the following type of matching problem.
Full is the type of matcher that the input string must exactly equal to the pattern.
Substr is the type of matcher that the input string must contain the pattern as a sub-string.
Domain is the type of matcher that the input string must be a sub-domain or itself of the pattern.
The DomainMatcher is divided into two parts:
full
anddomain
patterns are matched by Rabin-Karp algorithm & minimal perfect hash table;substr
patterns are matched by ac automaton;
Matching problem definition:
- a
domain
rulebaidu.com
can be seen asexact
matchmoc.udiab
andmoc.udiab.
when traversing the domain names in reverse order. Andmoc.udiab
andmoc.udiab.
should not appear in the middle of the string. - a
full
rulebaidu.com
can be seen asexact
matchmoc.udiab
when traversing the domain names in reverse order. Andmoc.udiab
should not appear in the middle of the string. - a
substr
rulebaidu.com
is a matching problem that check ifbaidu.com
is substring of the given domain names.substr
rules can be matched byACAutomaton
.
Through the above definition, we can merge the full
and domain
rules together to match. The simplest way is to store these rules in the HashMap
. However, when we query, we need to calculate the hash value of the same string and its substrings. This additional overhead can be reduced by rolling hash.
We choose 32bit FNV-prime 16777619
to calculate our rolling hash.
Inspired by "Hash, displace, and compress" algorithm, we can design a minimal perfect hash table through two rounds hashes. The first round of hash is rolling hash, which we get directly from the process of traversing the string. The second round of hash is memhash.
In this way, when checking whether the rule is hit, we only need to calculate the hash and compare it once.
#[inline(always)]
fn lookup(&self, h: RollingHashType, query_string: &str) -> bool {
let level0_idx = h & self.level0_mask;
let seed = self.level0[level0_idx as usize] as Level1HashType;
let level1_idx = seed.mem_hash(query_string) & self.level1_mask;
return self.rules[self.level1[level1_idx as usize] as usize] == query_string;
}