Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support syntax highlighting with tree-sitter #50140

Open
fcurts opened this issue May 19, 2018 · 120 comments
Open

Support syntax highlighting with tree-sitter #50140

fcurts opened this issue May 19, 2018 · 120 comments
Assignees
Labels
feature-request Request for new features or functionality languages-basic Basic language support issues tokenization Text tokenization
Milestone

Comments

@fcurts
Copy link

fcurts commented May 19, 2018

Please consider supporting tree-sitter grammars in addition to TextMate grammars. TextMate grammars are incredibly difficult to author and maintain and impossible to get right. The over 500 (!) issues reported against https://github.com/Microsoft/TypeScript-TmLanguage are a living proof of this.

This presentation explains the motivation and goals for tree-sitter: https://www.youtube.com/watch?v=a1rC79DHpmY

tree-sitter already ships with Atom and is also used on github.com.

@aeschli aeschli added feature-request Request for new features or functionality languages-basic Basic language support issues labels May 22, 2018
@aeschli
Copy link
Contributor

aeschli commented May 22, 2018

tree-sitter is cool technology, and we have our eyes on it.
I fully agree with you that TextMate grammars are challenging to implement and have limitations, but it's always a lot of work to create and maintain a grammar. That will not be different with Tree-Sitter.

If you already have experiences with specific grammars, e.g. the TypeScript grammar or the C-grammar, and think it is superior to the TextMate grammars, let us know. That would be the criteria for us to invest.

@aeschli aeschli added this to the Backlog milestone May 22, 2018
@Kroc
Copy link

Kroc commented May 22, 2018

This may help in the future with the whole 'embedding one language in another', which is an enfant terrible when it comes to TextMate grammars.

@omniomi
Copy link
Contributor

omniomi commented May 24, 2018

There's also a request in #5408 for .sublime-syntax which has been open since Apr 2016 which would also be a step up from .tmLanguage.

While tree-sitter has an awesome concept I can't say the idea of writing grammar in JavaScript is all that appealing.

@fcurts
Copy link
Author

fcurts commented May 24, 2018

@omniomi tree-sitter also supports writing grammars in pure JSON if that's what you prefer. The main & dramatic advantage of tree-sitter is that it's a full parsing system and not an ad-hoc, underspecified, horrifyingly complex yet extremely limited regex contraption.

@ahuertabhg
Copy link

Integrating tree-sitter would help solve this issue dotnet/vscode-csharp#2461

@sean-mcmanus
Copy link
Contributor

@aeschli Atom has switched to tree sitter for C++ and no longer fixing issues with Text Mate: atom/language-c#232 (comment) . Please advise on how we should proceed for improving the C++ syntax highlighting/etc. experience.

@maxbrunsfeld
Copy link

👋 Just to reiterate - the Atom team doesn't intend to disrupt other apps like VSCode that are using modules like language-c. We will definitely continue to accept good PRs that update the text-mate grammar.

The reason that we've been closing issues like that is just to be explicit about the fact that our team won't be prioritizing work on them in the future, since Atom is moving away from text-mate grammars.

@bobbrow
Copy link
Member

bobbrow commented Oct 1, 2018

@sean-mcmanus we already have our own syntax highlighting stuff (shared with Visual Studio), but haven't been able to use it because we are waiting on an API that lets us turn off tmLanguage and provide the coloring ourselves: #585. Moving to tree-sitter is only relevant to us so long as #585 is incomplete.

@bklebe
Copy link

bklebe commented Oct 22, 2018

Tree-sitter is extensible for other programming languages, and in particular already supports Rust and Ruby as well. Are the Visual Studio APIs ready to be extended with new language support in those ways?

@Geobert
Copy link

Geobert commented Nov 2, 2018

I'm wondering if tree-sitter can solve this #51157

@Stanzilla
Copy link

@bobbrow is that a finite decision? Would have been nice to share the code with Atom here.

@Astrantia
Copy link

No plans for this in 2018?

@github-yxb
Copy link

It's going to be 2019!!

@sean-mcmanus
Copy link
Contributor

sean-mcmanus commented Jan 8, 2019

Yeah, Atom 1.33 ships with tree sitter and most of the C/C++ colorization bugs have been fixed with it -- the Atom/language-c team is closing the non-tree sitter bugs.

@fcurts
Copy link
Author

fcurts commented Feb 13, 2019

I fully agree with you that TextMate grammars are challenging to implement and have limitations, but it's always a lot of work to create and maintain a grammar. That will not be different with Tree-Sitter.

@aeschli I meanwhile re-implemented my TextMate grammar with tree-sitter because the former proved unmaintainable (templated regexes up to 400 characters long, etc.). Developing the tree-sitter grammar and highlighter from scratch took three days, compared to three weeks for the TextMate grammar. The new highlighter works better and is dramatically easier to maintain. I wish I could use it in VS Code as well.

@jeff-hykin
Copy link

jeff-hykin commented Dec 17, 2021

@jasonwilliams true I guess that is the topic of that post.

Sadly though it is definitely not just a tokenizer that can be swapped out. Although semantic highlighting loosened this; from my understanding the theme system, internal functions, to code folding, to the syntax highlighting tags, along with many assumptions/optimizations are still tightly coupled to TextMate.

Here's a post that covers some of the early integrations. You may already know about it
https://code.visualstudio.com/blogs/2017/02/08/syntax-highlighting-optimizations

Atom was less inter-dependent internally because it was designed to be hackable, but it still had/has per-language manually-written conversion layers from tree sitter tokens to TextMate tokens because it's just so hard to fully do the tree sitter justice and not break everything.

@jasonwilliams
Copy link
Contributor

jasonwilliams commented Apr 3, 2022

I took a look at the current state of play, and to see if it's possible to break this up.

It looks like for each language it goes off to fetch a tokenizator implementation (which happens to be the TMTokenization class for now created here

I wonder if as a start this can also support a treeSitterTokenization class which can then map its tree types back into textmate types so that things further upstream continue to work. It would need to implement ITokenizationSupport.

My understanding is https://www.npmjs.com/package/web-tree-sitter would work in all platforms web and desktop.

Tokenize is currently called line by line and expects a result in return.
The new tokenization class would need to return a tokenizationResult to the implementation of EncodedTokenizationResult

From what I understand this is a similar approach the Atom team took when beginning to migrate.

Grammar Registry

One of the first tasks would be to support tree-sitter grammars along side textmate. A “type” field can be added and the path can point to the wasm file, this would be backwards compatible with textmate being the default. See atom/atom@9762685

The current grammar registry lives in vscode-textmate and it is specific to TMGrammars. You would most likely need a resolver and a registry equivalent which only fetch grammars that have a type tree-sitter.
You would probably want the TMGrammar resolver to filter out gramnmars that are of type tree-sitter.

Another option is to have a new format entirely a la what anycode does.

I believe those changes could be done today without causing breaking changes.

TreeSitterTokenization

There would need to be a treeSitterTokenization class which is equivalent to TMTokenization class This class would also call on the grammar registry to fetch the right grammars and load them in. I think for now the tokenizer class may need to return tokens in the same format for compatibility upstream to be the same?

There would need to be some mapping somehow from tree-sitter types back to TM Types, although https://github.com/georgewfraser/vscode-tree-sitter does this so I'm guessing it's possible. If you map back there aren't a huge amount of changes, you could probably stop here and change things further upstream at a later point.

@zm-cttae
Copy link

Is there real tangible data on the performance change from using Tree-sitter?

@jasonwilliams
Copy link
Contributor

@meche-gh #161479

@haikyuu
Copy link

haikyuu commented Feb 2, 2023

Slightly tangent to this, I find tree-sitter to be interesting for more than syntax highlighting.
It's used in neovim, helix and other editors to power some very useful features: incremental selection, folding, indentation out of the box.

And most importantly, external modules built around tree sitter are extremely useful.

I'd say we approach including tree-sitter in VSCode more holistically:

  • How to make tree-sitter trees available for extensions: (attach the tree to the open file, get parent tree of current selection, and other utils ...)
  • How grammars are loaded: Currently, helix and nvim have a runtime directory that contains tree sitter grammars and queries. And it's picked up by the editor when it starts.
  • What utilities to bake into VSCode by default?

These decisions will likely impact the first inclusion of tree-sitter into VSCode, be it syntax highlighting or others.

I don't know if it's clear or not. But including tree-sitter into VSCode is a huge benefit because it makes it aware of the code and not treat it like text. It may start with syntax highlighting (which is a bit already solved by textmate grammars) but doesn't end there.

If the benefit of having tree-sitter syntax highlighting isn't very big, I'd say it would be better to start with other simpler features that can live at the borders of VSCode as opposed to being in the core (syntax highlighting isn't simple to get right and not critical since it is working relatively well atm.)

When the basic setup of tree-sitter is done. A PR to have syntax highlighting will be much easier to build, review and merge.

@jasonwilliams
Copy link
Contributor

jasonwilliams commented Feb 2, 2023

Just giving my update and 2 cents.

@haikyuu those are some interesting thoughts, and I agree it’s a huge benefit all round.

I don’t agree about syntax highlighting being a solved problem because even though it “works” the performance is hitting its ceiling. I wrote about it here https://jason-williams.co.uk/posts/speeding-up-vscode-extensions-in-2022/ (see Tree Sitter section). If VSCode wants to stay competitive it will eventually need to migrate towards this in my opinion. Last time I looked at the performance of large files a lot of time was attributed to parsing.

I do agree with starting simple, but this will need to be in the core. I don’t want to see us go down a path of “everyone needs a tree sitter extension”, not that that’s what you were suggesting, but it would be good to see some roadmap for actually having it be the primary syntax system.
My comment above looks into some first steps of adding it as a service then utilising it bit by bit, but still having the textmate system used primarily. This should provide at least some migration path for extensions going forward.

I did look into branching of #161479 but it’s a monumental effort as it touches so many parts of the code base. So it isn’t something I could take on alone, especially if the maintainers are already planning to work on this (we don’t know, they are quiet on this topic, although there’s still positive signals they’re interested in investigating).

ABI Stability

There was concern over stability which may have been the reason progress in this area went quiet.

@alexdima did raise concerns around the ABI potentially changing causing extensions to break.

Although I reached out to the Tree Sitter maintainers who declared the library to be stable and there shouldn’t be any backwards incompatible changes.
Secondly Neovim, who have been using Tree Sitter for over 2 years, have only had forwards compatibility issues but not backwards. The former are more easily solved by having extensions build trees against a specific version before publishing.

@haikyuu
Copy link

haikyuu commented Feb 2, 2023

@jasonwilliams I agree this should land into core for optimal experience. And the performance benefit is not to neglect (I am personally using neovim at the moment and everything feels way faster)

@zm-cttae-archive
Copy link

zm-cttae-archive commented Feb 2, 2023

There would need to be some mapping somehow from tree-sitter types back to TM Types, although vscode-tree-sitter does this so I'm guessing it's possible. If you map back there aren't a huge amount of changes, you could probably stop here and change things further upstream at a later point.

If the Tree-Sitter community wants to scale to existing themes, they need to plan their token names ahead of time and standardise it the way that Sublime and Textmate have done, and also the way Microsoft began to do with the LSP token format a couple months in.

@zm-cttae-archive
Copy link

Nvm, if the mapping is done by the grammar owner, that would be a small portion of the current effort needed to maintain Textmate grammars.. Would suck more if there was no TM grammar but even then the mapping would only be painful once

@MixusMinimax
Copy link

I fully agree with you that TextMate grammars are challenging to implement and have limitations, but it's always a lot of work to create and maintain a grammar. That will not be different with Tree-Sitter.

While I agree with the fact that tree-sitter grammars also have their difficulties, the difference is that many of us need to define a tree-sitter grammar anyway, since you can use it for other things, like an LSP server, or a compiler. A textmate grammar would have to be maintained in parallel with whatever other parser generator you're using for other components.

@codethief
Copy link

codethief commented Feb 26, 2024

Am I reading #161479 (comment) correctly that tree-sitter support is not going to happen any time soon? :\

@texastoland
Copy link

The team hasn't been active in any linked issues, hasn't publicly expressed intent to change direction, and arguably has competing interests with unifying VS Code syntax definitions with either VS and/or Monaco/Monarch. I consider this issue closed in practice.

@serverhorror
Copy link

unifying VS Code syntax definitions with either VS and/or Monaco/Monarch

so ... where do we go to ask ... "either VS and/or Monaco/Monarch" to adopt treesitter :)

@texastoland
Copy link

texastoland commented Feb 28, 2024

@heartacker
Copy link
Contributor

#206739 🛩️

Code Editor

Explore using the new EditContext API #204371 @hediet
💪 Explore hover enriching #195394 @aiday-mar @hediet
🔴 Explore tree-sitter parser ecosystem @alexr00

@michaelblyons
Copy link

#207416 for those who missed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Request for new features or functionality languages-basic Basic language support issues tokenization Text tokenization
Projects
None yet
Development

Successfully merging a pull request may close this issue.