Lossless syntax tree prototype and discussion #189

nickolay · 2020-06-07T08:03:10Z

I'm opening this draft PR to share the prototype I have at https://github.com/nickolay/sqlparser-rs/tree/wip-cst

I hope that people, who contribute, already use, or want to use sqlparser-rs, react here (if only via a GitHub reaction) so that we can judge if this is something worth pursuing as part of this library or if I'm completely off-track. There's a detailed write-up in the branch's README, but the short version is that current AST is lossy (w.r.t whitespace, comments, unimportant syntax, and source code locations) and to fix that I propose an additional intermediate layer, which is lossless. Some changes to the AST will be necessary too, but to what extent is up for discussion.

Apart from gauging the general mood and collecting feedback, I hope that documenting what I have so far will enable someone to beat me at coming up with the specific design for the CST/AST, as I have not been able to dedicate much time to this.

I'll include a quote from the motivational section of the README:

Preserving full source code information (#161) would enable SQL rewriting/refactoring tools based on sqlparser-rs. For example:

Error reporting, both in the parser and in later stages of query processing, would benefit from knowing the source code location of SQL constructs (Add line/column information consistently to every parser message. #179)

SQL pretty-printing requires comments to be preserved in AST (see Comment handling #175, mentioning forma)

Refactoring via AST transformations would also benefit from having full control over serialization, a possible solution for dialect-specific "writers" (Implement SQL writer #18)

Analyzing partially invalid code may be useful in the context of an IDE or other tooling.

I think that adopting rust-analyzer's design, that includes a lossless syntax tree, is the right direction for sqlparser-rs. In addition to solving the use-cases described above, it helps in other ways:

We can omit syntax that does not affect semantics of the query (e.g. ROW vs ROWS) from the typed AST by default, reducing the implementation effort.

Having a homogenous syntax tree also alleviates the need for a "visitor" (Add an AST visitor #114), also reducing the burden on implementors of new syntax

`Parser::parse_sql()` is intended to be only a helper, the user should be able to use it or not, without having to re-implement parts of the parsing logic.

Adjusting parser's `self.index` always felt like a hack, and it becomes more important as I need to store more state in my "lossless syntax tree" prototype.

Part of the `cst` branch, see its README for details.

These are re-enabled in the "Support parser backtracking in the GreenNodeBuilder" commit.

This adds the common code to allow building a lossless syntax tree based on the `rowan` crate, in parallel with our current AST. The motivation is discussed in the README. The CST is a flat list of tokens by default, structure can be added incrementally in future commits. Two changes are split into their own commits, that should be squashed with this, once the GreenNodeBuilder change is upstreamed: - Fix `TBD: rowan's builder does not allow reverting to a checkpoint` - Hide rowan behind a non-default `cst` feature Other tasks to be done later, marked with `TBD` in the code: - Fix `Token`/`SyntaxKind` duplication, changing the former to store a slice of the original source code, e.g. `(SyntaxKind, SmolStr)` This should also fix the currently disabled test-cases where `Token`'s `to_string()` does not return the original string: * `parse_escaped_single_quote_string_predicate` * `parse_literal_string` (the case of `HexLiteralString`) - Fix the hack in `parse_keyword()` to remap the token type (RA has `bump_remap() for this) - Fix the `Parser::pending` hack (needs rethinking parser's API) Probably related is the issue of handling whitespace and comments: the way this prototype handles it looks wrong.

Copy GreenNodeBuilder from rowan with the minimal changes to make it work as part of our crate.

This adds a `reset()` method to the GreenNodeBuilder previously imported from rowan and uses it to fix the CST in the testcases where parser has to backtrack in `parse_table_factor`.

The naming scheme for `SyntaxKind`s and the CST structure is only an example, see the README for the discussion. Note also the 'TBD' markers in the patch.

Dandandan · 2020-06-07T10:30:46Z

I am using sqlparser-rs for a prototype.

Locations of ast nodes (for parser error messages, but also generated type errors, linting messages, etc.) will be incredibly valuable for this!

I will try reviewing this prototype.

maxcountryman · 2020-06-08T13:55:12Z

This is great. I'm very interested in this sort of thing since it opens the door to all sorts of things that simply aren't possible with the current implementation. I'm personally interested in using this for a project to format SQL in a way that preserves comments.

panarch · 2020-06-09T14:19:56Z

This is awesome, it is obviously worth fully enough, I'd really appreciate.

I'm currently making a new sql database engine based on key-value storage, planning to release the first version in Aug, this year under MPL 2.0 license.

Attractive points to see may be like these below,

Purely written in Rust
All core sql executing code is fully written in functional, no even single mut is used.
Easily swappable storage engine - At first I'll use sled as a default storage, but anyone else can attach their own storage engine easily by implementing simple traits it provides.
WebAssembly support, so it can be used in web frontend development. I'm expecting this db to be used in state management.

And the parser I've used was nom-sql but it didn't have many necessary features I need and looks not updated recently, so I was considering to find some alternatives.
Then by Materialize db announcement, I got to know about this project.
So, what I'm currently working is replacing the parser from nom-sql to this sqlparser-rs.

I'll really look forward to see this new change. thanks.

hwchen · 2020-06-29T16:43:54Z

This looks pretty interesting. I'm wondering if this will also increase performance by being zero-copy? (If I'm misunderstanding the design, let me know). For my use-case, speed is a priority over features, assuming basic parsing is equal.

Dandandan · 2020-06-29T17:35:13Z

@hwchen I think the main reason of the change is not performance, but rather functionality: easily allow tools to generate the same output, keeping comments, capitalization, spacing, etc. This allows for better errors, tools like formatters, etc.
I think you are right that zero copy could improve performance by a bit, although it probably won't be a lot in this case.

If you are interested in contributing to performance (I think sqlparser is already quite fast in the general case), I would suggest to see if you can add something to the benchmarks https://github.com/ballista-compute/sqlparser-rs/tree/main/sqlparser_bench , see if you can improve the performance through some profiling or maybe try to solve a performance issue #216

SKalt · 2021-03-21T13:58:25Z

HI, what steps are still needed to get this PR ready for review? Specifically, how might I help?

maxcountryman · 2021-03-22T14:47:02Z

HI, what steps are still needed to get this PR ready for review? Specifically, how might I help?

I'd also love to help.

SKalt · 2021-03-23T20:49:17Z

Looks like the 1st item of business would be updating the CST-experiment branch with the features and fixes now in the mainline. The 0th item of business would be identifying what those features/fixes are.

SKalt · 2021-03-24T12:49:12Z

There's a fair amount of work on sqlparser-rs that's happened since June (that's a good thing!). I'm cautiously optimistic that a careful merge and will work rather than redoing the existing rowan/CST implementation.

git log --oneline wip-cst..real/main output if you're interested

43fef23 (cargo-release) start next development iteration 0.8.1-alpha.0
34cd794 (cargo-release) version 0.8.0
a868ff6 Add release notes
add8991 feat: support sqlite insert or statement (#281)
07342d5 Support parsing multiple show variables. (#290)
f40955e Parse floats without leading number (#294)
6f0b2dc Implement SUBSTRING(col [FROM <expr>] [FOR <expr>]) syntax (#293)
8a214f9 Implement Hive QL Parsing (#235)
17f8eb9 Fix clippy lints (#287)
200ed5e (cargo-release) start next development iteration 0.7.1-alpha.0
e11b80e (cargo-release) version 0.7.0
97cd1c0 Release 0.7.0 instead
9930bdf (cargo-release) start next development iteration 0.6.3-alpha.0
26c281e (cargo-release) version 0.6.2
d66294f Add date
e18e8dc Prepare 0.6.2
94ff468 Support ANALYZE TABLE syntax (#285)
17f2930 Introduce support for EXPLAIN [ANALYZE] [VERBOSE] <STATEMENT> syntax
cbd3c6b Merge pull request #263 from ballista-compute/documentation/cargo-release-config
929fc67 Merge pull request #260 from eyalleshem/single_tables_in_parens
ad72cda [snowflake] Support specifying an alias after `FROM (table_factor)`
d9e044a Extend one_statement_parses_to to also test parsing canonical SQL
4128dfe Introduce tests/test_utils/mod.rs and use it consistently
9f772f0 Add support for Recursive CTEs (#278)
99fb633 Move existing SF tests to sqlparser_snowflake.rs
54be391 Update CHANGELOG
7dc5d4c Follow-up to #275: Bump simple_logger version in Cargo.toml
1ac2083 Support IF NOT EXISTS for CREATE SCHEMA (#276)
926b03a Add parsing for PostgreSQL math operators (#267)
0ac343a Don't publish on the push of any tag
cc4f51f Update releasing.md docs
2f71324 Fix deprecated way of initializing SimpleLogger (#275)
cf7263c Fix typo in the README
01a2a6b Update CHANGELOG (#261)
1c6077c [snowflake] Support single line comments starting with '#' or '//' (#264)
e9aa87f Update bigdecimal requirement from 0.1 to 0.2 (#268)
a5b7524 Fix clippy linting error, use enumerate (#266)
038ef98 Add Ballista to the README as a user of this crate (#262)
fcf1eb1 add a note about cargo release config
118a345 Merge pull request #238 from maxcountryman/feature/gh-action-releases
f500a42 Add snowflake dialect (#259)
2c6c295 Refactor <column definition> parsing (#251)
66505eb Don't fail parsing a column definition with unexpected tokens
23f5c7e Move parse_column_def below parse_columns
1b46e82 Enable dialect specific behaviours in the parser (#254)
3871bbc Enable dependabot for this repository (#256)
5f3a40e Dependency updates (#255)
61431b0 Support TABLE functions in FROM (#253)
a246d5d Undo accidental commit
caeb046 Enable dependabot for this repository
1cc3bf4 Support named arguments in function invocations (#250)
9351efb update release instructions
580e4b1 Merge pull request #252 from ballista-compute/fixup/use-nightly-fmt
6b37c16 fix typo
1a70c6e document initial release process
cac3a8e provide missing license header
76a911b ensure we use nightly with fmt
9c1a5a7 Don't fail parsing ALTER TABLE ADD COLUMN ending with a semicolon (#246)
f8feff4 Add SQLite dialect (#248)
4452f9b Support specifying ASC/DESC in index columns (#249)
9e7e302 Support identifiers quoted with backticks in the MySQL dialect (#247)
1337820 Merge pull request #245 from nickolay/cleanups
a6e30b3 Fix typo in JSONFILE serialization
9a2d86d Change CREATE INDEX serialization to not end with a semicolon
9371652 Fix "unused stmt" warning in tests, with default features
3e880b5 Use consistent style for Display impls
d0db8a2 Run cargo fmt
09ca14f Support dialect-specific auto-increment column options for MySQL and SQLite (#234)
8020b2e Add Postgres-specific PREPARE, EXECUTE and DEALLOCATE (#243)
d2e4340 Support create or replace view/table (#239)
bc9bfae automate crate publishing
f053383 Release 0.6.1
3a42b69 Release 0.6.0 (#232)
583f22b Remove PostgreSQL version of assert (#229)
c24b0e0 Implement ASSERT statement (#226)
5cab189 Add TPCH reggression tests (#221)
f3b9edc update travis badge to point to actions status (#219)
8cc7702 update branch references to `main` (#215)
2a6d5f2 update cargo manifest (#214)
a53f1d2 Support SQLite `CREATE VIRTUAL TABLE` (#209)
0c83e5d Support SQLite's WITHOUT ROWID in CREATE TABLE (#208)
0c82be5 Follow-up to the recent release (CHANGELOG and the bench crate)
1946791 (cargo-release) start next development iteration 0.5.2-alpha.0
05f8992 (cargo-release) version 0.5.1
15d5f71 Add CREATE TABLE AS support (#206)
26361fd Implement ALTER TABLE DROP COLUMN (#148)
faeb7d4 Implement ALTER TABLE ADD COLUMN and RENAME (#203)
fab6e28 Output DataType capitalized (#202)
b24dbe5 Replace FromStr with normal parser function for FileFormat (#201)
68afa2a Make FileFormat case insensitive (#200)
a0f076a Make the cli example print JSON (#197)
f4fbd9b Take slice as input for parse_keywords (#199)
6cdd4a1 Support general "typed string" literals (#187)
34548e8 Change Word::keyword to a enum (#193)
0fe3a8e Use Token::EOF instead of Option<Token> (#195)
2f10153 Add serde support to AST structs and enums (#196)
d9a7491 Various follow-ups to recent pushes
846c52f Allow omitting units after INTERVAL (#184)
d842f49 Add line and column number to TokenizerError (#194)
10b0b7f Update CHANGELOG (#192)
a42121d Use binary search to speed up matching keywords (#191)
af54eb0 Rework github actions, add code coverage (#186)
6e6fae7 Add benchmarks using cargo bench / criterion (#190)

SKalt · 2021-04-03T20:13:25Z

requesting debugging: I merged the main branch into the wip-cst branch, leaving the product on branch cst-refresh of skalt/sqlparser-rs, PR'd against wip-cst. I'm new to both rust and this codebase¹, so I could use help debugging a failing unit test, parser::tests::test_prev_index. It looks like a call to prev_token() is chopping the "version" off of "SELECT version" despite the best efforts of parser.syntax() to consume any remaining tokens.

1: This codebase has amazing comments!

nickolay · 2021-04-05T02:22:00Z

@SKalt Thanks for taking an interest in this and sorry to keep you waiting!

Regarding the failing test, one of the problems is that you didn't sync up SyntaxNode with Token.

More generally, I suggest rebasing instead of trying to blindly merge this:

Otherwise it's really hard to spot the merge mistakes in a huge merge commit.
Redoing the patches is also a good way to get familiar with the code,
You'll be able to see what change introduces the problem
The first few commits from this branch are not really related to the CST work and could probably be landed separately - rebasing will make it possible to land them independently of finishing up the rest of the work.

As for the next steps, I tried to detail the current status and the outstanding tasks in the branch's README, the most important and the hardest one being coming up with a new typed AST design. Try working with raw SyntaxNodes to see what I mean.

alamb · 2021-08-20T18:26:04Z

Hi @nickolay -- sorry for the delay in review. I am going to help out now with this repo and we are working to clear the backlog. Is this PR still something you would like to work on to help contribute?

nickolay added 13 commits June 6, 2020 08:00

Change CASE representation in AST

2f54dcb

Improve code comments

46b2b61

Split parse_statements from parse_sql

06fe8db

`Parser::parse_sql()` is intended to be only a helper, the user should be able to use it or not, without having to re-implement parts of the parsing logic.

Introduce parser.start/reset to support back-tracking

77a8adf

Adjusting parser's `self.index` always felt like a hack, and it becomes more important as I need to store more state in my "lossless syntax tree" prototype.

Update the README for the CST branch

f172113

Add a script to extract the list of AST fields into a table

8f2019a

Part of the `cst` branch, see its README for details.

Temporarily disable tests that require backtracking support in rowan

e088c8a

These are re-enabled in the "Support parser backtracking in the GreenNodeBuilder" commit.

Support parser backtracking in the GreenNodeBuilder, part 1

b5fbe42

Copy GreenNodeBuilder from rowan with the minimal changes to make it work as part of our crate.

Support parser backtracking in the GreenNodeBuilder, part 2

8f6b0d8

This adds a `reset()` method to the GreenNodeBuilder previously imported from rowan and uses it to fix the CST in the testcases where parser has to backtrack in `parse_table_factor`.

Hide rowan behind a non-default cst feature

d977eaa

Add some structure to the CST (TBD)

75c8326

The naming scheme for `SyntaxKind`s and the CST structure is only an example, see the README for the discussion. Note also the 'TBD' markers in the patch.

Add serde to AST nodes

5ba4a55

nickolay mentioned this pull request Jun 7, 2020

Comment handling #175

Open

panarch mentioned this pull request Jun 9, 2020

Add serde support to AST structs and enums #183

Closed

nickolay mentioned this pull request Jun 22, 2020

Add conditional compilation to support different SQL dialect #207

Closed

maxcountryman mentioned this pull request Jul 5, 2020

Experiment with lossless parser maxcountryman/forma#67

Open

This was referenced Jul 14, 2020

[question] Is it possible to parse incomplete SQL? #224

Open

Produce design for writing custom SQL parsers by extending this library #7

Open

Dandandan changed the base branch from master to main July 20, 2020 16:33

nickolay mentioned this pull request Jul 20, 2020

Does not support the AS keyword #230

Open

eyalleshem mentioned this pull request Jul 27, 2020

Dialect-specific parsing and Snowflake JSON support #241

Open

maxcountryman mentioned this pull request Mar 24, 2021

Parse errors should show location of error in input sql string #301

Open

ever0de mentioned this pull request Aug 25, 2021

[Discussion] Remove use of dialect_of macro #339

Open

alamb closed this Dec 8, 2021

olalonde mentioned this pull request Jul 15, 2022

Add support of locations in AST #524

Closed

lovasoa mentioned this pull request Sep 4, 2022

Add an AST visitor #601

Closed

aochagavia mentioned this pull request Mar 21, 2024

Add Span to Tokens, AST nodes #161

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lossless syntax tree prototype and discussion #189

Lossless syntax tree prototype and discussion #189

nickolay commented Jun 7, 2020

Dandandan commented Jun 7, 2020

maxcountryman commented Jun 8, 2020

panarch commented Jun 9, 2020

hwchen commented Jun 29, 2020

Dandandan commented Jun 29, 2020

SKalt commented Mar 21, 2021

maxcountryman commented Mar 22, 2021

SKalt commented Mar 23, 2021

SKalt commented Mar 24, 2021

SKalt commented Apr 3, 2021

nickolay commented Apr 5, 2021

alamb commented Aug 20, 2021

Lossless syntax tree prototype and discussion #189

Lossless syntax tree prototype and discussion #189

Conversation

nickolay commented Jun 7, 2020

Dandandan commented Jun 7, 2020

maxcountryman commented Jun 8, 2020

panarch commented Jun 9, 2020

hwchen commented Jun 29, 2020

Dandandan commented Jun 29, 2020

SKalt commented Mar 21, 2021

maxcountryman commented Mar 22, 2021

SKalt commented Mar 23, 2021

SKalt commented Mar 24, 2021

SKalt commented Apr 3, 2021

nickolay commented Apr 5, 2021

alamb commented Aug 20, 2021