Make peg$computePosDetails a little faster #439

markw65 · 2023-09-10T17:41:23Z

Some grammars seem to spend a lot of time in peg$computePosDetails. examples/xml.peggy is one example. According to the VSCode profiler, it's the single most expensive self-time function, and with typical data seems to take about 20% of the total time.

This change takes advantage of the fact that generally, calls to peg$computePosDetails that don't hit in the cache will actually be beyond the last element (because in general parsing proceeds from the start of the file to the end), so we can use peg$posDetailsCache.length to find the previous element, rather than walking backwards testing each element.

In my tests, I get a 5-10% speedup (depending on the input) for the parser generated from examples/xml.peggy. Other grammars I've tested get significantly smaller wins, but I can generally still see an observable win in the profiler.

Mingun

Nice catch!

markw65 · 2023-09-13T15:33:38Z

This doesn't look quite right to me.

It's explicitly checking for \n characters as end of line. That should work for both \n (linux) and \r\n (windows), but won't work for \r (MacOS prior to 10.x), or the other Unicode line separators \u2028 and \u2029.

So eg examples/javascript.pegjs has

LineTerminator
  = [\n\r\u2028\u2029]

LineTerminatorSequence "end of line"
  = "\n"
  / "\r\n"
  / "\r"
  / "\u2028"
  / "\u2029"

So it will correctly parse files that use \r (only), but will report incorrect locations for them.

I thought about switching to a regex in that loop, which would make it easy to have a default such as /\r\n|[\r\n\u2028\u2029]/, and would allow an easy override via the options, but doing so made the loop much much slower (I also tried things like input.slice(p, pos).split(/\r\n|[\r\n\u2028\u2029/), and matchAll and exec which seem like they might simplify things, but everything I tried was even slower.

The one thing that does seem to work is a switch:

      "      while (p < pos) {",
      "        switch (input.charCodeAt(p++)) {",
      "          case 13:",
      "            if (input.charCodeAt(p) === 10) {",
      "              break;",
      "            }",
      "          case 10:",
      "          case 0x2028:",
      "          case 0x2029:",
      "            details.line++;",
      "            details.column = 1;",
      "            break;",
      "          default:",
      "            details.column++;",
      "        }",
      "      }",

That seems to be just as fast as the current if/else (maybe a little faster, but it's noisy enough that I'm not sure). It's not as easy to make configurable, but thats probably not necessary anyway. Does that seem worth adding?

Mingun · 2023-09-13T15:52:48Z

If I remember project history correctly, previously this code processes all kinds of line ending, but later David change that. So you can look at blame to find the reasoning

hildjj · 2023-12-09T16:12:19Z

This one next, please.

Mingun approved these changes Sep 10, 2023

View reviewed changes

markw65 added 3 commits December 9, 2023 08:20

Make peg$computePosDetails a little faster

a9c2cc1

Add a CHANGELOG entry

7575bc0

Build Artifacts

7c1c759

markw65 force-pushed the faster-pos-details branch from 776427b to 7c1c759 Compare December 9, 2023 16:26

hildjj merged commit f2c271f into peggyjs:main Dec 9, 2023
9 checks passed

markw65 deleted the faster-pos-details branch December 10, 2023 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make peg$computePosDetails a little faster #439

Make peg$computePosDetails a little faster #439

markw65 commented Sep 10, 2023

Mingun left a comment

markw65 commented Sep 13, 2023

Mingun commented Sep 13, 2023

hildjj commented Dec 9, 2023

Make peg$computePosDetails a little faster #439

Make peg$computePosDetails a little faster #439

Conversation

markw65 commented Sep 10, 2023

Mingun left a comment

Choose a reason for hiding this comment

markw65 commented Sep 13, 2023

Mingun commented Sep 13, 2023

hildjj commented Dec 9, 2023