Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make peg$computePosDetails a little faster #439

Merged
merged 3 commits into from
Dec 9, 2023

Conversation

markw65
Copy link

@markw65 markw65 commented Sep 10, 2023

Some grammars seem to spend a lot of time in peg$computePosDetails. examples/xml.peggy is one example. According to the VSCode profiler, it's the single most expensive self-time function, and with typical data seems to take about 20% of the total time.

This change takes advantage of the fact that generally, calls to peg$computePosDetails that don't hit in the cache will actually be beyond the last element (because in general parsing proceeds from the start of the file to the end), so we can use peg$posDetailsCache.length to find the previous element, rather than walking backwards testing each element.

In my tests, I get a 5-10% speedup (depending on the input) for the parser generated from examples/xml.peggy. Other grammars I've tested get significantly smaller wins, but I can generally still see an observable win in the profiler.

Copy link
Member

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@markw65
Copy link
Author

markw65 commented Sep 13, 2023

This doesn't look quite right to me.

It's explicitly checking for \n characters as end of line. That should work for both \n (linux) and \r\n (windows), but won't work for \r (MacOS prior to 10.x), or the other Unicode line separators \u2028 and \u2029.

So eg examples/javascript.pegjs has

LineTerminator
  = [\n\r\u2028\u2029]

LineTerminatorSequence "end of line"
  = "\n"
  / "\r\n"
  / "\r"
  / "\u2028"
  / "\u2029"

So it will correctly parse files that use \r (only), but will report incorrect locations for them.

I thought about switching to a regex in that loop, which would make it easy to have a default such as /\r\n|[\r\n\u2028\u2029]/, and would allow an easy override via the options, but doing so made the loop much much slower (I also tried things like input.slice(p, pos).split(/\r\n|[\r\n\u2028\u2029/), and matchAll and exec which seem like they might simplify things, but everything I tried was even slower.

The one thing that does seem to work is a switch:

      "      while (p < pos) {",
      "        switch (input.charCodeAt(p++)) {",
      "          case 13:",
      "            if (input.charCodeAt(p) === 10) {",
      "              break;",
      "            }",
      "          case 10:",
      "          case 0x2028:",
      "          case 0x2029:",
      "            details.line++;",
      "            details.column = 1;",
      "            break;",
      "          default:",
      "            details.column++;",
      "        }",
      "      }",

That seems to be just as fast as the current if/else (maybe a little faster, but it's noisy enough that I'm not sure). It's not as easy to make configurable, but thats probably not necessary anyway. Does that seem worth adding?

@Mingun
Copy link
Member

Mingun commented Sep 13, 2023

If I remember project history correctly, previously this code processes all kinds of line ending, but later David change that. So you can look at blame to find the reasoning

@hildjj
Copy link
Contributor

hildjj commented Dec 9, 2023

This one next, please.

@hildjj hildjj merged commit f2c271f into peggyjs:main Dec 9, 2023
9 checks passed
@markw65 markw65 deleted the faster-pos-details branch December 10, 2023 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants