From b18effe413281bcbe64eafd26528da8f91a5a170 Mon Sep 17 00:00:00 2001 From: xumingming Date: Sat, 27 Jan 2024 18:22:05 +0800 Subject: [PATCH] Add a blog post for LIKE optimizations --- website/blog/2024-01-27-like-optimization.mdx | 146 ++++++++++++++++++ website/blog/authors.yml | 5 + 2 files changed, 151 insertions(+) create mode 100644 website/blog/2024-01-27-like-optimization.mdx diff --git a/website/blog/2024-01-27-like-optimization.mdx b/website/blog/2024-01-27-like-optimization.mdx new file mode 100644 index 000000000000..37c8828f58be --- /dev/null +++ b/website/blog/2024-01-27-like-optimization.mdx @@ -0,0 +1,146 @@ +--- +slug: like +title: "Improve LIKE's performance" +authors: [xumingming] +tags: [tech-blog,performance] +--- + +## What is LIKE? + +LIKE is a very useful operation, +it is used to do string pattern matching, the following examples are from Presto doc: + +``` +SELECT * FROM (VALUES ('abc'), ('bcd'), ('cde')) AS t (name) +WHERE name LIKE '%b%' +--returns 'abc' and 'bcd' + +SELECT * FROM (VALUES ('abc'), ('bcd'), ('cde')) AS t (name) +WHERE name LIKE '_b%' +--returns 'abc' + +SELECT * FROM (VALUES ('a_c'), ('_cd'), ('cde')) AS t (name) +WHERE name LIKE '%#_%' ESCAPE '#' +--returns 'a_c' and '_cd' +``` + +These examples show the basic usage of LIKE: + +- Use `%` to match zero or more characters. +- Use `_` to match exactly one character. +- If we need to match `%` and `_` literally, we can specify escape char to escape them. + +When we use Velox as the backend to evaluate Presto's query, LIKE operation is translated +into Velox's function call, e.g. `name LIKE '%b%'` is translated to +`like(name, '%b%')`. Internally Velox converts the pattern string into a regular +expression and then uses regular expression library RE2 +to do the pattern matching. RE2 is a very good regular expression library, it is fast +and safe which gives Velox LIKE a good performance. But some popularly used simple patterns +can be optimized to use simple C++ string functions to implement directly, +e.g. Pattern `hello%` matches inputs that start with `hello`, which can be implemented by +memory comparing the prefix bytes of inputs: + +``` +// Match the first 'length' characters of string 'input' and prefix pattern. +bool matchPrefixPattern( + StringView input, + const std::string& pattern, + size_t length) { + return input.size() >= length && + std::memcmp(input.data(), pattern.data(), length) == 0; +} +``` + +It is much faster than using RE2, benchmark shows it gives us a 750x speedup. We can do similar +optimizations for some other patterns: + +- `%hello`: matches inputs that start with `hello`. It can be optimized by memory comparing the suffix bytes of the inputs. +- `%hello%`: matches inputs that contain `hello`. It can be optimized by using `std::string_view::find` to check whether inputs contain `hello`. + +These simple patterns are straightforward to optimize, there are some more +relaxed patterns that are not so straightforward: + +- `hello_velox%`: matches inputs that start with 'hello', followed by any character, then followed by 'velox'. +- `%hello_velox`: matches inputs that end with 'hello', followed by any character, then followed by 'velox'. +- `%hello_velox%`: matches inputs that contains both 'hello' and 'velox', and there is a single character separating them. + +Although these patterns look similar to previous ones, but they are not so straightforward +to optimize, `_` here matches any single character, we can not simply use memory comparison to +do the matching. And if user's input is not pure ASCII, `_` might match more than one byte which +makes the implementation even more complex. And also note that the patterns above are just for +illustrative purpose, actual patterns can be more complex, e.g. `h_e_l_l_o`, so trivial algorithm +will not work. + +## Optimizing Relaxed Patterns + +We optimized these patterns as follows. First, we split the patterns into a list of sub patterns, e.g. +`hello_velox%` is split into sub-patterns: `hello`, `_`, `velox`, `%`, because there is +a `%` at the end, we determine it as a `kRelaxedPrefix` pattern, which means we need to some prefix +matching, but it is not a trivial prefix matching, we need to match three sub-patterns: + +- kLiteralString: hello +- kSingleCharWildcard: _ +- kLiteralString: velox + +For `kLiteralString` we simply do a memory comparison: + +``` +if (subPattern.kind == SubPatternKind::kLiteralString && + std::memcmp( + input.data() + start + subPattern.start, + patternMetadata.fixedPattern().data() + subPattern.start, + subPattern.length) != 0) { + return false; +} +``` + +Note that since it is a memory comparison, it handles both pure ASCII inputs and inputs that +contains Unicode characters. + +Matching `_` is more complex considering that there are variable length multi-bytes character in +unicode inputs. Fortunately there are existing libraries which provides unicode related operations: +utf8proc. It provides functions that tells +us whether a byte in input is the start of a character or not, how many bytes current character +consists of etc. So to match a sequence of `_` our algorithm is: + +``` +if (subPattern.kind == SubPatternKind::kSingleCharWildcard) { + // Match every single char wildcard. + for (auto i = 0; i < subPattern.length; i++) { + if (cursor >= input.size()) { + return false; + } + + auto numBytes = unicodeCharLength(input.data() + cursor); + cursor += numBytes; + } +} +``` + +Here `cursor` is the index in the input we are trying to match, `unicodeCharLength` is +a function which wraps utf8proc function to determine how many bytes current character consists of, +so the logic is basically repeatedly calculate size of current character and skip it. + +It seems not that complex, but we should note that this logic is not effective for pure ASCII input, +for pure ASCII input, every character is one byte, to match a sequence of `_`, we don't need to +calculate the size of each character, don't need the for loop, actually we don't need to explicitly +match `_` for pure ASCII input at all, following is the whole logic for ASCII input: + +``` +for (const auto& subPattern : patternMetadata.subPatterns()) { + if (subPattern.kind == SubPatternKind::kLiteralString && + std::memcmp( + input.data() + start + subPattern.start, + patternMetadata.fixedPattern().data() + subPattern.start, + subPattern.length) != 0) { + return false; + } +} +``` + +It only matches the kLiteralString pattern at the right position of the inputs, `_` is automatically +matched(actually skipped), no need to match it explicitly. With this optimization we get 40x speedup +for kRelaxedPrefix patterns, 100x speedup for kRelaxedSuffix patterns. + +Thank you Maria Basmanova for spending a lot of time +reviewing the code. diff --git a/website/blog/authors.yml b/website/blog/authors.yml index 1ceeec1ead2d..82ac28b151b3 100644 --- a/website/blog/authors.yml +++ b/website/blog/authors.yml @@ -54,3 +54,8 @@ raulcd: url: https://github.com/raulcd image_url: https://github.com/raulcd.png +xumingming: + name: James Xu + title: Software Engineer @ Alibaba + url: https://github.com/xumingming + image_url: https://github.com/xumingming.png