-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Typed Template Haskell quotations / splices not handled correctly #116
Comments
Added to the TODO list for my in-progress redesign, and fixed the most egregious bug for now. But I don't see what's broken with the typed splices – do you have an example? |
Thanks for the quick response. I think you're right that typed splices are fine, actually. I didn't have any top-level typed splices when I experienced the problem, so I couldn't observe that they're handled ok as they were occurring inside of typed quotations. But AFAICS, typed splices are indeed handled ok. |
cool, thanks for the report! |
* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.
* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.
* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.
* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.
* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.
* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.
* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.
* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.
* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.
* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.
* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.
* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.
GHC with
TemplateHaskell
allows typed quotations and splices, writtenand
or just
Looks like the tree-sitter parser currently handles this incorrectly, breaking syntax highlighting in rather substantial ways for files containing these constructs.
I looked briefly whether there's a very quick fix, but unfortunately it looks like this requires adapting the scanner.
The text was updated successfully, but these errors were encountered: