Avoid double extraction of substrings in various MATCH_ bytecodes #427

markw65 · 2023-08-21T16:51:31Z

MATCH_* opcodes typically do something like

if (<test>(input.substr(peg$currPos, length))) {
  sN = input.substr(peg$currPos, length);
  ...
} else {
  sN = peg$FAILED;
  ...
}

where we extract the same substring twice.

compileInputChunkCondition will convert that to

sN = input.substr(peg$currPos, length);
if (<test>(sN)) {
  ...
} else {
  sN = peg$FAILED;
  ...
}

avoiding the overhead

markw65 · 2023-08-21T16:53:42Z

When stacked on #425 this makes the parser generated from examples/xml.peggy an additional 5% faster.

hildjj · 2023-08-21T17:45:23Z

Did you check coverage over the changed bits of code with just the one test added?

I need to add codecov.io to the CI action.

hildjj · 2023-08-21T17:50:07Z

Looks like line 584 might need a test.

markw65 · 2023-08-21T18:24:39Z

Looks like line 584 might need a test

I think thats probably true - the way bytecode is currently generated, I don't see a way to get an ACCEPT_N with a length greater than 1 which doesn't also hit the skipAccept condition.

The only way I can think of to hit that line would be to add an option to disable the transformation, and run a test with it disabled...

But how are you seeing that? npm test coverage doesn't mention that line for me...

hildjj · 2023-08-21T19:34:50Z

I downloaded the PR with gh pr checkout 427, ran npm build, and opened coverage/lcov-report/index.html in a browser.

markw65 · 2023-08-21T23:35:15Z

Ok, so yes, that line seems to be uncovered; and I've confirmed that the current byte code generator never generates a sequence that can hit that line.

In fact, the only thing that can hit the single character branch on the next line is MATCH_ANY.

So I've added a test to plugin-api.spec.js which inserts a munge-the-bytecode pass between generateBytecode and generateJS. It changes the bytecode so that compileInputChunkCondition doesn't see its expected pattern, and as a result we get full coverage (the test also verifies that the compileInputChunkCondition didn't fire by examining the generated parser).

markw65 · 2023-08-24T14:51:51Z

I think this is ready to go.

Mingun

I think, that some useful improvements could be made

Mingun · 2023-08-24T15:27:36Z

test/api/plugin-api.spec.js

+            expect(bc[0]).to.equal(19 /* MATCH_STRING_IC */);
+            bc.splice(4, 0, 5 /* PUSH_CUR_POS */, 6 /* POP */);


It is better to use real constants instead of comments. Import opcodes and

Suggested change

expect(bc[0]).to.equal(19 /* MATCH_STRING_IC */);

bc.splice(4, 0, 5 /* PUSH_CUR_POS */, 6 /* POP */);

expect(bc[0]).to.equal(op.MATCH_STRING_IC);

bc.splice(4, 0, op.PUSH_CUR_POS, op.POP);

I agree - all the examples I looked at seem to use the numbers though. Will fix.

Mingun · 2023-08-24T15:30:04Z

lib/compiler/passes/generate-js.js

+          cond = cond.replace("$$$.charCodeAt(0)", "input.charCodeAt(peg$currPos)");
+        }
+        compileCondition(
+          cond.replace("$$$", inputChunk),
+          argCount
+        );


It would better just accept a transformation function instead of replacing special placeholder, something like (inputChunk) => /*condition*/

Mingun · 2023-08-24T15:33:24Z

lib/compiler/passes/generate-js.js

+          : "input.substr(peg$currPos, " + inputChunkLength + ")";
+        if (bc[ip + baseLength] === op.ACCEPT_N
+              && bc[ip + baseLength + 1] === inputChunkLength) {
+          skipAccept = true;


I think, it should be possible just consume 2 opcodes at once instead of introducing this flag. If you recognize pattern, just pretend that you have a new long opcode and process it

I need to think about this a bit... it's not just two bytecodes; it's a MATCH bytecode followed by a then clause which starts with ACCEPT_N, and an else clause. So the ACCEPT_N is handled in a recursive call to compile.

But yes, presumably I can make compileCondition spit out the increment to currPos, and then skip over the ACCEPT_N.

I'm going to push back on this one. The only way I see to do it is to put a modified copy of compileCondition into compileInputChunkCondition. Or maybe pass in some extra options to tell it to push a particular string at the beginning the then block, and then skip the first two bytecodes. Either way, I think the current solution is cleaner.

If you have other ideas, or would like me to do one of the above, let me know.

Or maybe pass in some extra options to tell it to push a particular string at the beginning the then block, and then skip the first two bytecodes.

Yes, that is what I mean. Basically you pretend that instead of this opcode:

[ op.MATCH_STRING, 0, // string 0 if_len, else_len, op.ACCEPT_N, 2, ..., // if part ... // else part ]

you process this opcode:

[ /* virtual opcode */, if_len-2, else_len, ..., // if part ... // else part ]

So it seems that all that you need it is correctly calculate IF part to pass to recursion. It seems that you just need add a new parameter to compileCondition:

-function compileCondition(cond, argCount) { +function compileCondition(cond, argCount, thenSkip) { - const thenLength = bc[ip + baseLength - 2]; + const thenLength = bc[ip + baseLength - 2] - thenSkip; () => { - ip += baseLength; + ip += baseLength + thenSkip; thenCode = compile(bc.slice(ip, ip + thenLength)); ip += thenLength; }, }

Pass 0 is all existing invocations and 2 (or whatever) in this new code (which is on line 391 currently)

That doesn't work, because It's only supposed to skip a part of the ACCEPT_N.

You also need to tell it to increment peg$currPos by the correct amount at the beginning of the then, and to bump stack.sp, to get things properly balanced...

And passing in a flag to do all that seems like way too much magic. I suppose it could be a callback, but even that feels much worse to me than the current solution.

Well, ok. Strictly speaking, there are limited count of MATCH + ACCEPT combinations, so instead of repeating it over and over they could be splitted into functions that would be called by the parser. But I'm unsure how this will affect performance

I'm actually working on some bytecode optimizations that will produce lots more MATCH combinations by pulling in subsequent bytecode.

That is to say that where currently we get

MATCH ACCEPT ELSE FAIL IF_FAILED // or IF_NOT_FAILED DO_FAILED_THINGS ELSE DO_MATCH_THINGS

My optimization converts that to

MATCH ACCEPT DO_MATCH_THINGS ELSE FAIL DO_FAILED_THINGS

This both saves the second conditional, and in many cases allows other cleanups - often it can avoid assigning the result of the match to anything, or assigning peg$FAILED to anything.

So I think this approach (which continues to work, and can be expanded after my next pull request), is the way to go, rather than pulling out MATCH_ACCEPT patterns.

Thinking about it, if you really hate the skipAccept flag, I think I can make a callback work fairly cleanly:

ip += baseLength; thenCode = (callback || compile)(bc.slice(ip, ip + thenLength)); ip += thenLength;

Then existing callers of compileCondition can be left unchanged, and compileInputChunkCondition can pass in a suitable callback to emulate the bits of accept that it needs to do, and skip over the first couple of bytecodes.

Let me know if you'd like me to do something like that.

No, this is not necessary, but if you like to try this approach I also wouldn't be against

I tried, and I mostly like it, so I've pushed the update. The only downside is that now the code to handle ACCEPT_N is split into two places - the actual ACCEPT_N, and the cut down version handled by the callback to compileCondition.

But it avoids the statefulness of skipAccept, so I think its a good change on balance.

markw65 · 2023-08-24T17:03:48Z

There's a new coverage issue, sorry. I'll fix and update.

Mingun

I would replace using of replace on text strings which we should treat here as opaque tokens. Other that that it is OK

Mingun · 2023-08-28T16:34:10Z

lib/compiler/passes/generate-js.js

+              inputChunkLength > 1
+                ? "peg$currPos += " + inputChunkLength + ";"
+                : "peg$currPos++;"


Of course inputChunkLength wouldn't be zero, but for symmetry with the code above I think it would be better to invert condition:

Suggested change

inputChunkLength > 1

? "peg$currPos += " + inputChunkLength + ";"

: "peg$currPos++;"

inputChunkLength === 1

? "peg$currPos++;"

: "peg$currPos += " + inputChunkLength + ";"

Mingun · 2023-08-28T16:43:49Z

lib/compiler/passes/generate-js.js

+                inputChunk = inputChunk.includes(".charAt(")
+                  ? inputChunk.replace(".charAt(", ".charCodeAt(")
+                  : `${inputChunk}.charCodeAt(0)`;
+                return `${inputChunk} === ${literal.charCodeAt(0)}`;


It seems, we could always generate inputChunk as containing .charCodeAt and this hackish replace wouldn't be needed? Or this is not the case for ignored case literals? Then you can pass a ignoreCase parameter to compileInputChunkCondition and generate different code depending on it.

Or pass in a parameter either "input.charAt(peg$currPos)" or "input.charCodeAt(peg$currPos)" if this will be more readable (it will be used only if literal has length of 1)

The point is that if we use the inputChunk twice, we want to do

s0 = input.charAt(peg$currPos); if (s0.charCodeAt(0)) { ... }

but if we only use it once, we want to do:

if (input.charCodeAt(peg$currPos)) { ... }

rather than

if (input.charAt(peg$currPos).charCodeAt(0)) { ... }

Currently, we never generate byte code that uses the input twice when we're extracting a single, case sensitive character; but thats because there's already an optimization in generate-bytecode to avoid using it twice whenever possible as an optimization.

The way I originally wrote this, with a template string, meant that the code that chose inputChunk, and the code that matched the charCodeAt(0); was all self contained in compileInputChunk. Now compileInputChunk chooses inputChunk, but the callback has to make the decision about charAt vs charCodeAt, so yes, I agree it all looks a bit magical...

I could pass a flag back to the callback saying whether the optimization had happened or not, so the callback could then choose to ignore inputChunk when the optimization didn't happen. Would that work?

Mingun · 2023-08-28T16:46:49Z

test/api/plugin-api.spec.js

@@ -166,5 +167,45 @@ describe("plugin API", () => {
      expect(parser.parse("x", { startRule: "b" })).to.equal("x");
      expect(parser.parse("x", { startRule: "c" })).to.equal("x");
    });
+
+    it("can munge the bytecode", () => {


Well... I learned a new word :)

markw65 · 2023-08-30T18:28:48Z

Anything else to do here?

markw65 · 2023-09-07T14:28:00Z

I've rebased this over #425 (thanks!) resolved the CHANGELOG.md "conflict", and rebuilt the build artifacts.

hildjj · 2023-09-07T15:02:28Z

Great, thanks. @Mingun final checks, please?

compare to peggyjs/peggy#427

hildjj mentioned this pull request Aug 21, 2023

Add a pass to simplify single-character choices #425

Merged

markw65 force-pushed the optimize-substrings branch from fe01e6f to 894ee5b Compare August 21, 2023 23:30

Mingun suggested changes Aug 24, 2023

View reviewed changes

markw65 force-pushed the optimize-substrings branch from dfbd231 to 8de3123 Compare August 24, 2023 17:34

Mingun reviewed Aug 28, 2023

View reviewed changes

markw65 mentioned this pull request Sep 6, 2023

Bytecode optimizer #429

Closed

markw65 added 7 commits September 7, 2023 07:15

Avoid double extraction of substrings in various MATCH_ bytecodes

4b71a04

Requested changes

e92d4e0

More bytecode munging for coverage

e62bc82

Drop skipAccept in favor of a thenFn

f0668ba

Add a CHANGELOG entry

8dda63b

Fixes from review

b8c14fc

Build Artifacts

181e338

markw65 force-pushed the optimize-substrings branch from 6abb00b to 181e338 Compare September 7, 2023 14:26

Mingun approved these changes Sep 7, 2023

View reviewed changes

hildjj merged commit f323563 into peggyjs:main Sep 7, 2023
9 checks passed

markw65 deleted the optimize-substrings branch September 7, 2023 15:53

MarcelBolten added a commit to MarcelBolten/phpeggy that referenced this pull request Sep 28, 2023

Avoid double extraction of substrings in various MATCH_ bytecodes

75af0e4

compare to peggyjs/peggy#427

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid double extraction of substrings in various MATCH_ bytecodes #427

Avoid double extraction of substrings in various MATCH_ bytecodes #427

markw65 commented Aug 21, 2023

markw65 commented Aug 21, 2023

hildjj commented Aug 21, 2023

hildjj commented Aug 21, 2023

markw65 commented Aug 21, 2023

hildjj commented Aug 21, 2023

markw65 commented Aug 21, 2023

markw65 commented Aug 24, 2023

Mingun left a comment

Mingun Aug 24, 2023

markw65 Aug 24, 2023

Mingun Aug 24, 2023

Mingun Aug 24, 2023

markw65 Aug 24, 2023

markw65 Aug 24, 2023

Mingun Aug 24, 2023 •

edited

Loading

markw65 Aug 24, 2023

Mingun Aug 24, 2023

markw65 Aug 24, 2023

markw65 Aug 24, 2023

Mingun Aug 24, 2023

markw65 Aug 24, 2023

markw65 commented Aug 24, 2023

Mingun left a comment

Mingun Aug 28, 2023

Mingun Aug 28, 2023

markw65 Aug 28, 2023

Mingun Aug 28, 2023

markw65 commented Aug 30, 2023

markw65 commented Sep 7, 2023

hildjj commented Sep 7, 2023

		expect(bc[0]).to.equal(19 /* MATCH_STRING_IC */);
		bc.splice(4, 0, 5 /* PUSH_CUR_POS /, 6 / POP */);

Avoid double extraction of substrings in various MATCH_ bytecodes #427

Avoid double extraction of substrings in various MATCH_ bytecodes #427

Conversation

markw65 commented Aug 21, 2023

markw65 commented Aug 21, 2023

hildjj commented Aug 21, 2023

hildjj commented Aug 21, 2023

markw65 commented Aug 21, 2023

hildjj commented Aug 21, 2023

markw65 commented Aug 21, 2023

markw65 commented Aug 24, 2023

Mingun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mingun Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markw65 commented Aug 24, 2023

Mingun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markw65 commented Aug 30, 2023

markw65 commented Sep 7, 2023

hildjj commented Sep 7, 2023

Mingun Aug 24, 2023 •

edited

Loading