Optimizing the sentences

facebookincubator · Feb 2, 2024 · 9d2dfd8 · 9d2dfd8
1 parent 5dcd2f1
commit 9d2dfd8
Showing 1 changed file with 27 additions and 26 deletions.
diff --git a/website/blog/2024-01-27-like-optimization.mdx b/website/blog/2024-01-27-like-optimization.mdx
@@ -7,8 +7,8 @@ tags: [tech-blog,performance]
 
 ## What is LIKE?
 
-<a href="https://prestodb.io/docs/current/functions/comparison.html#like">LIKE</a> is a very useful operation,
-it is used to do string pattern matching, the following examples are from Presto doc:
+<a href="https://prestodb.io/docs/current/functions/comparison.html#like">LIKE</a> is a very useful SQL operator.
+It is used to do string pattern matching. The following examples for LIKE usage are from the Presto doc:
 
 ```
 SELECT * FROM (VALUES ('abc'), ('bcd'), ('cde')) AS t (name)
@@ -28,17 +28,17 @@ These examples show the basic usage of LIKE:
 
 - Use `%` to match zero or more characters.
 - Use `_` to match exactly one character.
-- If we need to match `%` and `_` literally, we can specify escape char to escape them.
+- If we need to match `%` and `_` literally, we can specify an escape char to escape them.
 
 When we use Velox as the backend to evaluate Presto's query, LIKE operation is translated
 into Velox's function call, e.g. `name LIKE '%b%'` is translated to
 `like(name, '%b%')`. Internally Velox converts the pattern string into a regular
 expression and then uses regular expression library <a href="https://github.com/google/re2">RE2</a>
-to do the pattern matching. RE2 is a very good regular expression library, it is fast
-and safe which gives Velox LIKE a good performance. But some popularly used simple patterns
-can be optimized to use simple C++ string functions to implement directly,
-e.g. Pattern `hello%` matches inputs that start with `hello`, which can be implemented by
-memory comparing the prefix bytes of inputs:
+to do the pattern matching. RE2 is a very good regular expression library. It is fast
+and safe, which gives Velox LIKE function a good performance. But some popularly used simple patterns
+can be optimized using direct simple C++ string functions instead of regex.
+e.g. Pattern `hello%` matches inputs that start with `hello`, which can be implemented by direct memory
+comparison of prefix ('hello' in this case) bytes of input:
 
 ```
 // Match the first 'length' characters of string 'input' and prefix pattern.
@@ -51,14 +51,14 @@ bool matchPrefixPattern(
 }
 ```
 
-It is much faster than using RE2, benchmark shows it gives us a 750x speedup. We can do similar
+It is much faster than using RE2. Benchmark shows it gives us a 750x speedup. We can do similar
 optimizations for some other patterns:
 
-- `%hello`: matches inputs that end with `hello`. It can be optimized by memory comparing the suffix bytes of the inputs.
+- `%hello`: matches inputs that end with `hello`. It can be optimized by direct memory comparison of suffix bytes of the inputs.
 - `%hello%`: matches inputs that contain `hello`. It can be optimized by using `std::string_view::find` to check whether inputs contain `hello`.
 
-These simple patterns are straightforward to optimize, there are some more
-relaxed patterns that are not so straightforward:
+These simple patterns are straightforward to optimize. There are some more relaxed patterns that
+are not so straightforward:
 
 - `hello_velox%`: matches inputs that start with 'hello', followed by any character, then followed by 'velox'.
 - `%hello_velox`: matches inputs that end with 'hello', followed by any character, then followed by 'velox'.
@@ -67,8 +67,8 @@ relaxed patterns that are not so straightforward:
 Although these patterns look similar to previous ones, but they are not so straightforward
 to optimize, `_` here matches any single character, we can not simply use memory comparison to
 do the matching. And if user's input is not pure ASCII, `_` might match more than one byte which
-makes the implementation even more complex. And also note that the patterns above are just for
-illustrative purpose, actual patterns can be more complex, e.g. `h_e_l_l_o`, so trivial algorithm
+makes the implementation even more complex. Also note that the above patterns are just for
+illustrative purpose. Actual patterns can be more complex. e.g. `h_e_l_l_o`, so trivial algorithm
 will not work.
 
 ## Optimizing Relaxed Patterns
@@ -98,10 +98,9 @@ Note that since it is a memory comparison, it handles both pure ASCII inputs and
 contain Unicode characters.
 
 Matching `_` is more complex considering that there are variable length multi-bytes character in
-unicode inputs. Fortunately there are existing libraries which provides unicode related operations:
-<a href="https://juliastrings.github.io/utf8proc/">utf8proc</a>. It provides functions that tells
-us whether a byte in input is the start of a character or not, how many bytes current character
-consists of etc. So to match a sequence of `_` our algorithm is:
+unicode inputs. Fortunately there are existing libraries which provides unicode related operations: <a href="https://juliastrings.github.io/utf8proc/">utf8proc</a>.
+It provides functions that tells us whether a byte in input is the start of a character or not,
+how many bytes current character consists of etc. So to match a sequence of `_` our algorithm is:
 
 ```
 if (subPattern.kind == SubPatternKind::kSingleCharWildcard) {
@@ -117,15 +116,17 @@ if (subPattern.kind == SubPatternKind::kSingleCharWildcard) {
 }
 ```
 
-Here `cursor` is the index in the input we are trying to match, `unicodeCharLength` is
-a function which wraps utf8proc function to determine how many bytes current character consists of,
-so the logic is basically repeatedly calculate size of current character and skip it.
+Here:
 
-It seems not that complex, but we should note that this logic is not effective for pure ASCII input,
-for pure ASCII input, every character is one byte, to match a sequence of `_`, we don't need to
-calculate the size of each character, don't need the for loop, actually we don't need to explicitly
-match `_` for pure ASCII input at all, following is the whole logic for ASCII input:
+- `cursor` is the index in the input we are trying to match.
+- `unicodeCharLength` is a function which wraps utf8proc function to determine how many bytes current character consists of.
 
+So the logic is basically repeatedly calculate size of current character and skip it.
+
+It seems not that complex, but we should note that this logic is not effective for pure ASCII input.
+Every character is one byte in pure ASCII input. So to match a sequence of `_`, we don't need to calculate the size
+of each character and compare in a for-loop. In fact, we don't need to explicitly match `_` for pure ASCII input as well.
+We can use the following logic instead:
 ```
 for (const auto& subPattern : patternMetadata.subPatterns()) {
     if (subPattern.kind == SubPatternKind::kLiteralString &&
@@ -139,7 +140,7 @@ for (const auto& subPattern : patternMetadata.subPatterns()) {
 ```
 
 It only matches the kLiteralString pattern at the right position of the inputs, `_` is automatically
-matched(actually skipped), no need to match it explicitly. With this optimization we get 40x speedup
+matched(actually skipped). No need to match it explicitly. With this optimization we get 40x speedup
 for kRelaxedPrefix patterns, 100x speedup for kRelaxedSuffix patterns.
 
 Thank you <a href="https://github.com/mbasmanova">Maria Basmanova</a> for spending a lot of time