Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer implementation in Rust #2641

Merged
merged 8 commits into from
Dec 9, 2023
Merged

Tokenizer implementation in Rust #2641

merged 8 commits into from
Dec 9, 2023

Conversation

izeigerman
Copy link
Collaborator

@izeigerman izeigerman commented Dec 8, 2023

This update introduce the native implementation of the Tokenizer written in Rust. All parsing features available in the Python's version have been implemented but haven't been tested using the complete test suite yet.

Here are the remaining items that are out of the scope of this PR and will be addressed in future PRs:

  1. Codegen the Rust's TokenType enum from the Python's version to ensure that new tokens are added in one place only.
  2. A configuration flag which enables the native implementation in all parsers.
  3. Configure tests to run with the native implementation enabled.
  4. Update CI/CD to build and release the native package for all supported platforms.
  5. Add some unit tests to the rust codebase to cover some internals (eg. trie implementation).
  6. Refactor the Rust code to make it more idiomatic and to clean up python-related workarounds.

sqlglot/tokens.py Outdated Show resolved Hide resolved
sqlglot/tokens.py Outdated Show resolved Hide resolved
sqlglotrs/src/settings.rs Outdated Show resolved Hide resolved
sqlglotrs/src/tokenizer.rs Outdated Show resolved Hide resolved
sqlglotrs/src/tokenizer.rs Outdated Show resolved Hide resolved
self.char_at(self.current)
};

if alnum && self.current_char.is_alphanumeric() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole section is unnecessary

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's unnecessarily complexity for python performance

if index < self.size {
self.char_at(index)
} else {
'\0'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if you have \0 as a char? isn't that confusing? why isn't this just an optional char?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because using Option becomes cumbersome very quickly. You basically need to unwrap it explicitly everywhere you use it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you ever have \0 in a user submitted char?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can hardly imagine this to be the case since it's universally used as a null-terminator for a string.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for '\0'
I landed on the same conclusion separately.

if end <= self.size {
self.sql[start..end].iter().collect()
} else {
String::from("")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a constant? or is it being allocated every time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's allocated on heap every time. The type system doesn't permit using a constant value here, since String type is mutable in rust.

self.scan_radix_string(16, TokenType::HEX_STRING);
}

fn scan_radix_string(&mut self, radix: u32, radix_token_type: TokenType) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this called radix string?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to suggestions.

Copy link
Collaborator

@barakalon barakalon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful

sqlglotrs/src/token.rs Outdated Show resolved Hide resolved
sqlglotrs/src/tokenizer.rs Show resolved Hide resolved
sqlglotrs/src/tokenizer.rs Outdated Show resolved Hide resolved
if index < self.size {
self.char_at(index)
} else {
'\0'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for '\0'
I landed on the same conclusion separately.

setup.py Show resolved Hide resolved
sqlglot/tokens.py Outdated Show resolved Hide resolved
@izeigerman izeigerman merged commit ef308c0 into main Dec 9, 2023
5 checks passed
@izeigerman izeigerman deleted the rust-tokenizer branch December 9, 2023 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants