-
-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideas for gen lexer supporting "give back" for quest #276
Comments
Want to add a failing test case in a PR? |
I'm very tired, but instead of rolling back, perhaps if the element preceding the |
In your example 1, the Edit: I think |
Oh hah, so interestingly the compiler optimises this away completely. This:
Is optimised to this:
|
Thanks for brainstorming about this! Note that I was giving a simple example here, but in general it gets harder if the thing operated on by the quest operator is a really complex expression. Imagine a pattern like I want to see if I can hack out approach number 2 just to benchmark it and see how it compares to current code. I believe the complexity/efficiency could be very similar to the current codegen lexer if there are no quest operators, because it could still use the optimized case where an |
Yeah I'm aware, it was a typo. The comment before contained the expression under question: As for the approach, In general I prefer to do the simplest thing possible unless it's clear that it won't be viable. In this case I think it's exceedingly unlikely that there will be complex expressions involving multiple To be honest, the generated code is super naive. I initially wanted to generate a table-driven DFA matcher like re2dfa, but that would have been quite a bit more work than the current implementation. Even more ideally I would have just used that code directly, but unfortunately it is GPL3 :( All that is to say, I don't want to invest a significant amount of effort on this implementation. If a lot of work is to be put in, it would be better to put it into a DFA-based implementation. |
That should probably be true!! in my case maybe I'm doing too much in the tokenizer vs the parser, LOL :) I have a regular expression for a However, it is actually rare that in this expression the quest operator ever needs to "give back" to achieve accuracy (future parts of the pattern match on distinct literals past the quest operator, so normally being fully greedy is perfectly OK and the fast thing to do) -- it would not be good to give up the huge performance/simplicity of a "pure greedy" quest operator implementation, just in the unlikely case that one of these might need to be non-greedy. Maybe for now it's just better to document that the quest operator does not give back in the generated lexer, and let users manually adjust their expressions accordingly? I will try to adjust my expressions manually to avoid reliance on any quest operator to give back, and see if that allows my test suite to pass fully. As an aside, it looks like the license for re2dfa doesn't put a license on generated code, only on the generator itself. Now that the participle lexer generator is separated from the core of the library itself, it might be possible to use re2dfa by the generator first generating an input file and feeding that to re2dfa with all of the relevant tokens, and then using the code generated by re2dfa in the code that the participle lexer generator generates. That would avoid any participle code being GPL'd, and the output would also be free of license constraints, while still building on re2dfa. Probably re2dfa would need some enhancements (maybe adding the notion of build tags etc.). To your point though, the lexer patterns should be simple and so all this complexity would perhaps be overkill. I'm super close to my test suite fully passing using the generated lexer so I'm hoping to start contributing some energy towards the generated parser next :) |
😳 wow...I would love to see that if you don't mind.
Awesome! That will require some thought I think, I'm still not sure what the best approach is to be honest. |
Sure, I'll DM you the regex :) With handmade changes to the lexer regex in a couple places to eliminate any "non-greedy" quest operator, my lexer/parser test suite is now passing! (about 300 tests) - I agree that most lexers should have simple regexp, so it seems like just clearly documenting that the generated lexer is fully greedy for quest operators is perhaps the way to go, vs something more complex / harder to maintain / slower. I'll submit a PR to suggest some adds to the readme about that. Including the optimizations you just committed today, the generated lexer is seeing an 8x speedup in lexing, and still with zero allocs! (the 4 allocs in the genlexer benchmark I believe are just to allocate the initial lexer state)
This is a huge improvement and definitely good enough for my app. The approach taken by @petee-d for a generated parser at least for my use case gave an overall application speedup of 10x-15x and reduced allocations 15x-30x (some details given commenting on #213). There are the ergonomic issues raised in #264, and prep work to rebase was still in progress in #263, it seems the general algorithm there was quite fast if these can be resolved. I haven't taken a detailed look at the code yet to understand the approach yet. |
After #275 my lexer test suite was still failing. It looks like it's because the
Quest
regex operator (and related) doesn't support "giving back". For example, imagine you have the pattern for matching a 3 or 4 capital letter time zone:[A-Z][A-Z][A-Z]?T
-- the generated lexer currently will not matchEST
, because the current quest operator implementation is fully greedy (never gives back). The generated code will matchEST
for the first three operations, but that advances the position past the endingT
and thus the last literalT
does not match. It doesn't backtrack to try the case where the[A-Z]?
term was never matched, which would have allowed it to match.The current generated code looks something like:
It seems the implementation to support "give back" could be pretty complicated, so I wanted to discuss about possible good directions.
Ideas:
[A-Z][A-Z][A-Z]?T
would be rewritten internally to[A-Z][A-Z][A-Z]T|[A-Z][A-Z]T
before going through codegen. This could quickly lead to exponential explosion of the number of alternates for any reasonably complex regex, but the code itself would never have to "backtrack". It's simple to implement, but could be horribly slow, doesn't seem like a good option.l3
func above) to return a list of positions, rather than a single position (or an empty list instead of -1). Then the caller would explore all of the possibilities starting from that point (or if there was no match because it got an empty list, would stop and return an empty list right there). It feels like this direction might work, but would require some memory allocation with the returned lists (maybe there is a way to optimize by passing a shared list or using an allocation pool). A sketch of this might look like:Probably this method could be optimized where it could recognize if a given function can only ever return a single position, and if so avoid the overhead of allocation. e.g., it could look more like:
This variant is nice because it can still use
p
regardless of whether it's in a loop (previous function had multiple possibilities) or just an if statement (previous function had just one possibility). With an allocation pool for the matchedPositions array, maybe it wouldn't actually be that bad (memory complexity would be related to the number of possible open "branches" at any given time)What do you think? Any ideas on better directions?
The text was updated successfully, but these errors were encountered: