Tokenizer implementation in Rust #2641

izeigerman · 2023-12-08T06:20:16Z

This update introduce the native implementation of the Tokenizer written in Rust. All parsing features available in the Python's version have been implemented but haven't been tested using the complete test suite yet.

Here are the remaining items that are out of the scope of this PR and will be addressed in future PRs:

Codegen the Rust's TokenType enum from the Python's version to ensure that new tokens are added in one place only.
A configuration flag which enables the native implementation in all parsers.
Configure tests to run with the native implementation enabled.
Update CI/CD to build and release the native package for all supported platforms.
Add some unit tests to the rust codebase to cover some internals (eg. trie implementation).
Refactor the Rust code to make it more idiomatic and to clean up python-related workarounds.

sqlglot/tokens.py

sqlglotrs/src/settings.rs

sqlglotrs/src/token.rs

sqlglotrs/src/tokenizer.rs

tobymao · 2023-12-08T06:38:17Z

sqlglotrs/src/tokenizer.rs

+            self.char_at(self.current)
+        };
+
+        if alnum && self.current_char.is_alphanumeric() {


this whole section is unnecessary

it's unnecessarily complexity for python performance

tobymao · 2023-12-08T06:38:57Z

sqlglotrs/src/tokenizer.rs

+        if index < self.size {
+            self.char_at(index)
+        } else {
+            '\0'


what happens if you have \0 as a char? isn't that confusing? why isn't this just an optional char?

Because using Option becomes cumbersome very quickly. You basically need to unwrap it explicitly everywhere you use it.

can you ever have \0 in a user submitted char?

I can hardly imagine this to be the case since it's universally used as a null-terminator for a string.

+1 for '\0'
I landed on the same conclusion separately.

tobymao · 2023-12-08T06:39:25Z

sqlglotrs/src/tokenizer.rs

+        if end <= self.size {
+            self.sql[start..end].iter().collect()
+        } else {
+            String::from("")


is this a constant? or is it being allocated every time?

It's allocated on heap every time. The type system doesn't permit using a constant value here, since String type is mutable in rust.

tobymao · 2023-12-08T06:41:21Z

sqlglotrs/src/tokenizer.rs

+        self.scan_radix_string(16, TokenType::HEX_STRING);
+    }
+
+    fn scan_radix_string(&mut self, radix: u32, radix_token_type: TokenType) {


why is this called radix string?

I'm open to suggestions.

barakalon

beautiful

sqlglotrs/src/token.rs

sqlglotrs/src/tokenizer.rs

barakalon · 2023-12-08T16:14:42Z

sqlglotrs/src/tokenizer.rs

+        if index < self.size {
+            self.char_at(index)
+        } else {
+            '\0'


+1 for '\0'
I landed on the same conclusion separately.

setup.py

sqlglot/tokens.py

.github/workflows/python-package.yml

izeigerman requested review from tobymao, barakalon and georgesittas December 8, 2023 06:20

izeigerman force-pushed the rust-tokenizer branch from 6e61709 to cbb4d9f Compare December 8, 2023 06:28