-
Notifications
You must be signed in to change notification settings - Fork 567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lossless syntax tree prototype and discussion #189
Conversation
`Parser::parse_sql()` is intended to be only a helper, the user should be able to use it or not, without having to re-implement parts of the parsing logic.
Adjusting parser's `self.index` always felt like a hack, and it becomes more important as I need to store more state in my "lossless syntax tree" prototype.
Part of the `cst` branch, see its README for details.
These are re-enabled in the "Support parser backtracking in the GreenNodeBuilder" commit.
This adds the common code to allow building a lossless syntax tree based on the `rowan` crate, in parallel with our current AST. The motivation is discussed in the README. The CST is a flat list of tokens by default, structure can be added incrementally in future commits. Two changes are split into their own commits, that should be squashed with this, once the GreenNodeBuilder change is upstreamed: - Fix `TBD: rowan's builder does not allow reverting to a checkpoint` - Hide rowan behind a non-default `cst` feature Other tasks to be done later, marked with `TBD` in the code: - Fix `Token`/`SyntaxKind` duplication, changing the former to store a slice of the original source code, e.g. `(SyntaxKind, SmolStr)` This should also fix the currently disabled test-cases where `Token`'s `to_string()` does not return the original string: * `parse_escaped_single_quote_string_predicate` * `parse_literal_string` (the case of `HexLiteralString`) - Fix the hack in `parse_keyword()` to remap the token type (RA has `bump_remap() for this) - Fix the `Parser::pending` hack (needs rethinking parser's API) Probably related is the issue of handling whitespace and comments: the way this prototype handles it looks wrong.
Copy GreenNodeBuilder from rowan with the minimal changes to make it work as part of our crate.
This adds a `reset()` method to the GreenNodeBuilder previously imported from rowan and uses it to fix the CST in the testcases where parser has to backtrack in `parse_table_factor`.
The naming scheme for `SyntaxKind`s and the CST structure is only an example, see the README for the discussion. Note also the 'TBD' markers in the patch.
I am using sqlparser-rs for a prototype. Locations of ast nodes (for parser error messages, but also generated type errors, linting messages, etc.) will be incredibly valuable for this! I will try reviewing this prototype. |
This is great. I'm very interested in this sort of thing since it opens the door to all sorts of things that simply aren't possible with the current implementation. I'm personally interested in using this for a project to format SQL in a way that preserves comments. |
This is awesome, it is obviously worth fully enough, I'd really appreciate. I'm currently making a new sql database engine based on key-value storage, planning to release the first version in Aug, this year under MPL 2.0 license. Attractive points to see may be like these below,
And the parser I've used was nom-sql but it didn't have many necessary features I need and looks not updated recently, so I was considering to find some alternatives. I'll really look forward to see this new change. thanks. |
This looks pretty interesting. I'm wondering if this will also increase performance by being zero-copy? (If I'm misunderstanding the design, let me know). For my use-case, speed is a priority over features, assuming basic parsing is equal. |
@hwchen I think the main reason of the change is not performance, but rather functionality: easily allow tools to generate the same output, keeping comments, capitalization, spacing, etc. This allows for better errors, tools like formatters, etc. If you are interested in contributing to performance (I think sqlparser is already quite fast in the general case), I would suggest to see if you can add something to the benchmarks https://github.com/ballista-compute/sqlparser-rs/tree/main/sqlparser_bench , see if you can improve the performance through some profiling or maybe try to solve a performance issue #216 |
HI, what steps are still needed to get this PR ready for review? Specifically, how might I help? |
I'd also love to help. |
Looks like the 1st item of business would be updating the CST-experiment branch with the features and fixes now in the mainline. The 0th item of business would be identifying what those features/fixes are. |
There's a fair amount of work on
|
requesting debugging: I merged the main branch into the wip-cst branch, leaving the product on branch cst-refresh of skalt/sqlparser-rs, PR'd against wip-cst. I'm new to both rust and this codebase1, so I could use help debugging a failing unit test, 1: This codebase has amazing comments! |
@SKalt Thanks for taking an interest in this and sorry to keep you waiting! Regarding the failing test, one of the problems is that you didn't sync up More generally, I suggest rebasing instead of trying to blindly merge this:
As for the next steps, I tried to detail the current status and the outstanding tasks in the branch's README, the most important and the hardest one being coming up with a new typed AST design. Try working with raw |
Hi @nickolay -- sorry for the delay in review. I am going to help out now with this repo and we are working to clear the backlog. Is this PR still something you would like to work on to help contribute? |
I'm opening this draft PR to share the prototype I have at https://github.com/nickolay/sqlparser-rs/tree/wip-cst
I hope that people, who contribute, already use, or want to use
sqlparser-rs
, react here (if only via a GitHub reaction) so that we can judge if this is something worth pursuing as part of this library or if I'm completely off-track. There's a detailed write-up in the branch's README, but the short version is that current AST is lossy (w.r.t whitespace, comments, unimportant syntax, and source code locations) and to fix that I propose an additional intermediate layer, which is lossless. Some changes to the AST will be necessary too, but to what extent is up for discussion.Apart from gauging the general mood and collecting feedback, I hope that documenting what I have so far will enable someone to beat me at coming up with the specific design for the CST/AST, as I have not been able to dedicate much time to this.
I'll include a quote from the motivational section of the README: