Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a blog post for LIKE optimizations #8576

Closed
wants to merge 3 commits into from

Conversation

xumingming
Copy link
Contributor

No description provided.

Copy link

netlify bot commented Jan 27, 2024

Deploy Preview for meta-velox ready!

Name Link
🔨 Latest commit 9d2dfd8
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65bc375bdfe85b0008d561b9
😎 Deploy Preview https://deploy-preview-8576--meta-velox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 27, 2024
@xumingming
Copy link
Contributor Author

cc @mbasmanova

relaxed patterns that are not so straightforward:

- `hello_velox%`: matches inputs that start with 'hello', followed by any character, then followed by 'velox'.
- `hello_velox%`: matches inputs that end with 'hello', followed by any character, then followed by 'velox'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: hello_velox% -> %hello_velox ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Fixed.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xumingming James, thank you for writing this nice. It reads very well.

@pedroerp Pedro, would you also take a look?

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xumingming for this blog post. It is very useful.

Had some nits in the writing.


## What is LIKE?

<a href="https://prestodb.io/docs/current/functions/comparison.html#like">LIKE</a> is a very useful operation,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Use multiple short sentences instead of commas.

"LIKE is a very useful SQL operator. It is used to do string pattern matching. The following examples for LIKE usage are from the Presto doc:"


- Use `%` to match zero or more characters.
- Use `_` to match exactly one character.
- If we need to match `%` and `_` literally, we can specify escape char to escape them.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : specify "an" escape character

into Velox's function call, e.g. `name LIKE '%b%'` is translated to
`like(name, '%b%')`. Internally Velox converts the pattern string into a regular
expression and then uses regular expression library <a href="https://github.com/google/re2">RE2</a>
to do the pattern matching. RE2 is a very good regular expression library, it is fast
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here : Use full stop between the sentences "RE2 is a very good regular expression library. It is fast and safe, which gives Velox LIKE function a good performance."

expression and then uses regular expression library <a href="https://github.com/google/re2">RE2</a>
to do the pattern matching. RE2 is a very good regular expression library, it is fast
and safe which gives Velox LIKE a good performance. But some popularly used simple patterns
can be optimized to use simple C++ string functions to implement directly,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : "can be optimized using direct simple C++ string functions instead of regex."

and safe which gives Velox LIKE a good performance. But some popularly used simple patterns
can be optimized to use simple C++ string functions to implement directly,
e.g. Pattern `hello%` matches inputs that start with `hello`, which can be implemented by
memory comparing the prefix bytes of inputs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : " can be implemented by direct memory comparison of prefix ('hello' in this case) bytes of input"

Although these patterns look similar to previous ones, but they are not so straightforward
to optimize, `_` here matches any single character, we can not simply use memory comparison to
do the matching. And if user's input is not pure ASCII, `_` might match more than one byte which
makes the implementation even more complex. And also note that the patterns above are just for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : "Also note that the above patterns are just for illustrative purposes. Actual patterns in practice can be more complex."

}
```

Here `cursor` is the index in the input we are trying to match, `unicodeCharLength` is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe format this as follows

Here :

  • 'cursor' is the index in the input we are trying to match.
  • 'unicodeCharLength' ....

So the logic is basically repeatedly....

a function which wraps utf8proc function to determine how many bytes current character consists of,
so the logic is basically repeatedly calculate size of current character and skip it.

It seems not that complex, but we should note that this logic is not effective for pure ASCII input,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End sentence here.

so the logic is basically repeatedly calculate size of current character and skip it.

It seems not that complex, but we should note that this logic is not effective for pure ASCII input,
for pure ASCII input, every character is one byte, to match a sequence of `_`, we don't need to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit sentence :
"Every character is one byte in pure ASCII input. So to match a sequence of '', we don't need to calculate the size of each character and compare in a for-loop. Infact, we don't need to explicitly match '' for pure ASCII input as all. We can use the following logic instead:"

}
```

It only matches the kLiteralString pattern at the right position of the inputs, `_` is automatically
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End this sentence with full-stop.

@xumingming
Copy link
Contributor Author

@aditi-pandit Thanks for the review, made corresponding changes to all the comments.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in 3004e34.

Copy link

Conbench analyzed the 1 benchmark run on commit 3004e349.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

FelixYBW pushed a commit to FelixYBW/velox that referenced this pull request Feb 12, 2024
Summary: Pull Request resolved: facebookincubator#8576

Reviewed By: Yuhta, kgpai

Differential Revision: D53308906

Pulled By: mbasmanova

fbshipit-source-id: 31a1efe0d5472ccc9f2a1c81602e402d2f8c8e8a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants