Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect parse due to top-level splices #111

Open
wenkokke opened this issue Jan 24, 2024 · 13 comments
Open

Incorrect parse due to top-level splices #111

wenkokke opened this issue Jan 24, 2024 · 13 comments

Comments

@wenkokke
Copy link
Contributor

wenkokke commented Jan 24, 2024

The following piece of Haskell code is SOMETIMES incorrectly parsed as containing a series of top-level splices.

id :: a -> a
id x = x

const x y = x

fib :: Integer -> Integer
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)

Specifically, the common parse is as follows:

(haskell
  (signature
    name: (variable)
    type: "::"
    type: (fun
      (type_name
        (type_variable)
      )
      "->"
      (type_name
        (type_variable)
      )
    )
  )
  (function
    name: (variable)
    patterns: (patterns
      (pat_name
        (variable)
      )
    )
    "="
    rhs: (exp_name
      (variable)
    )
  )
  (function
    name: (variable)
    patterns: (patterns
      (pat_name
        (variable)
      )
      (pat_name
        (variable)
      )
    )
    "="
    rhs: (exp_name
      (variable)
    )
  )
  (signature
    name: (variable)
    type: "::"
    type: (fun
      (type_name
        (type)
      )
      "->"
      (type_name
        (type)
      )
    )
  )
  (function
    name: (variable)
    patterns: (patterns
      (pat_literal
        (integer)
      )
    )
    "="
    rhs: (exp_literal
      (integer)
    )
  )
  (function
    name: (variable)
    patterns: (patterns
      (pat_literal
        (integer)
      )
    )
    "="
    rhs: (exp_literal
      (integer)
    )
  )
  (function
    name: (variable)
    patterns: (patterns
      (pat_name
        (variable)
      )
    )
    "="
    rhs: (exp_infix
      (exp_apply
        (exp_name
          (variable)
        )
        (exp_parens
          "("
          (exp_infix
            (exp_name
              (variable)
            )
            (operator)
            (exp_literal
              (integer)
            )
          )
          ")"
        )
      )
      (operator)
      (exp_apply
        (exp_name
          (variable)
        )
        (exp_parens
          "("
          (exp_infix
            (exp_name
              (variable)
            )
            (operator)
            (exp_literal
              (integer)
            )
          )
          ")"
        )
      )
    )
  )
)

However, in some circumstances it is parsed to the following very incorrect syntax tree:

(haskell
  (top_splice
    (exp_name
      (variable)
    )
  )
  (signature
    name: (variable)
    type: "::"
    type: (fun
      (type_name
        (type_variable)
      )
      "->"
      (type_name
        (type_variable)
      )
    )
  )
  (top_splice
    (exp_name
      (variable)
    )
  )
  (function
    name: (variable)
    "="
    rhs: (exp_name
      (variable)
    )
  )
  (function
    name: (variable)
    patterns: (patterns
      (pat_name
        (variable)
      )
      (pat_name
        (variable)
      )
    )
    "="
    rhs: (exp_name
      (variable)
    )
  )
  (top_splice
    (exp_name
      (variable)
    )
  )
  (signature
    name: (variable)
    type: "::"
    type: (fun
      (type_name
        (type)
      )
      "->"
      (type_name
        (type)
      )
    )
  )
  (top_splice
    (exp_name
      (variable)
    )
  )
  (function
    pattern: (pat_literal
      (integer)
    )
    "="
    rhs: (exp_literal
      (integer)
    )
  )
  (top_splice
    (exp_name
      (variable)
    )
  )
  (function
    pattern: (pat_literal
      (integer)
    )
    "="
    rhs: (exp_literal
      (integer)
    )
  )
  (top_splice
    (exp_name
      (variable)
    )
  )
  (function
    name: (variable)
    "="
    rhs: (exp_infix
      (exp_apply
        (exp_name
          (variable)
        )
        (exp_parens
          "("
          (exp_infix
            (exp_name
              (variable)
            )
            (operator)
            (exp_literal
              (integer)
            )
          )
          ")"
        )
      )
      (operator)
      (exp_apply
        (exp_name
          (variable)
        )
        (exp_parens
          "("
          (exp_infix
            (exp_name
              (variable)
            )
            (operator)
            (exp_literal
              (integer)
            )
          )
          ")"
        )
      )
    )
  )
)

This bug is showing up in the test suite for the Haskell support for cursorless-dev/cursorless, and @pokey and I are working to determine exactly when the bug shows up. However, even if we cannot nail down exactly when the bug shows up, it might be worthwhile ruling the second parse out entirely. Perhaps by anchoring top-level function definitions to the first column, or forcing a line-break after a top-level splice?

@tek
Copy link
Contributor

tek commented Jan 24, 2024

I'm working on something for this, should be ready in a week or two

@wenkokke
Copy link
Contributor Author

I'm working on something for this, should be ready in a week or two

Wow, that was a quick response. Amazing!

@wenkokke
Copy link
Contributor Author

@tek Do you know why it happens?

@tek
Copy link
Contributor

tek commented Jan 24, 2024

Wow, that was a quick response. Amazing!

😅 I happened to refresh the notifications page right when you posted

@tek Do you know why it happens?

Because there are many declared conflicts in the grammar, which causes tree-sitter to run two (or more) parallel attempts at parsing a sequence, one of which doesn't contain the _layout_semicolon symbol that that scanner emits when indent should force a new decl.
The choice is then made based on dynamic precedence scores, which aren't well balanced in the grammar (and a bit mysterious).

@wenkokke
Copy link
Contributor Author

wenkokke commented Feb 8, 2024

@tek Any updates on this?

@tek
Copy link
Contributor

tek commented Feb 8, 2024

still a lot of work to do to get this to release, but I've got a week off so the chances are good 😁

@wenkokke
Copy link
Contributor Author

any suggestions for workarounds I could use in the meantime?

@tek
Copy link
Contributor

tek commented Feb 20, 2024

other than using explicit semicolons, don't think it's that simple 😕

@wenkokke
Copy link
Contributor Author

my current idea for a workaround is to fork and remove splices entirely, and use that fork in the meantime

@tek
Copy link
Contributor

tek commented Feb 20, 2024

oh, you could try giving the top_splice rule a penalty:

top_splice: $ => prec.dynamic(-100, $._exp_infix),

not sure if that isn't too late in the reduction chain though.

@tek
Copy link
Contributor

tek commented Mar 15, 2024

@wenkokke prerelease for you to testdrive: https://github.com/tek/tree-sitter-haskell

tek added a commit that referenced this issue Mar 24, 2024
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
tek added a commit that referenced this issue Mar 30, 2024
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
tek added a commit that referenced this issue Mar 30, 2024
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
tek added a commit that referenced this issue Mar 30, 2024
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
tek added a commit that referenced this issue Apr 2, 2024
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
tek added a commit that referenced this issue Apr 2, 2024
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
tek added a commit that referenced this issue May 4, 2024
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
tek added a commit that referenced this issue May 4, 2024
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
tek added a commit that referenced this issue May 4, 2024
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
tek added a commit that referenced this issue May 4, 2024
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
amaanq pushed a commit that referenced this issue May 4, 2024
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
tek added a commit that referenced this issue May 4, 2024
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
amaanq pushed a commit that referenced this issue May 5, 2024
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
@pokey
Copy link

pokey commented Jun 14, 2024

As of vscode-parse-tree version 0.31.0, we are now shipping with tree-sitter-haskell version a50070d, which I believe should include the fix, but @tek lmk if I'm wrong.

Fwiw here is the parse I get for the code snippet at the top of this issue:

(haskell
  declarations: (declarations
    (signature
      name: (variable)
      "::"
      type: (function
        parameter: (variable)
        arrow: "->"
        result: (variable)
      )
    )
    (function
      name: (variable)
      patterns: (patterns
        (variable)
      )
      match: (match
        "="
        expression: (variable)
      )
    )
    (function
      name: (variable)
      patterns: (patterns
        (variable)
        (variable)
      )
      match: (match
        "="
        expression: (variable)
      )
    )
    (signature
      name: (variable)
      "::"
      type: (function
        parameter: (name)
        arrow: "->"
        result: (name)
      )
    )
    (function
      name: (variable)
      patterns: (patterns
        (literal
          (integer)
        )
      )
      match: (match
        "="
        expression: (literal
          (integer)
        )
      )
    )
    (function
      name: (variable)
      patterns: (patterns
        (literal
          (integer)
        )
      )
      match: (match
        "="
        expression: (literal
          (integer)
        )
      )
    )
    (function
      name: (variable)
      patterns: (patterns
        (variable)
      )
      match: (match
        "="
        expression: (infix
          left_operand: (apply
            function: (variable)
            argument: (parens
              "("
              expression: (infix
                left_operand: (variable)
                operator: (operator)
                right_operand: (literal
                  (integer)
                )
              )
              ")"
            )
          )
          operator: (operator)
          right_operand: (apply
            function: (variable)
            argument: (parens
              "("
              expression: (infix
                left_operand: (variable)
                operator: (operator)
                right_operand: (literal
                  (integer)
                )
              )
              ")"
            )
          )
        )
      )
    )
  )
)

@wenkokke I believe you were seeing failures only when running tests on your cursorless haskell branch? I haven't tried this new version there so can't verify whether the fix has worked

@tek
Copy link
Contributor

tek commented Jun 14, 2024

yep this should be resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants