Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Sort literal tokens by literal length #21

Closed
RodneyMKay opened this issue Sep 19, 2023 · 4 comments
Closed

Suggestion: Sort literal tokens by literal length #21

RodneyMKay opened this issue Sep 19, 2023 · 4 comments

Comments

@RodneyMKay
Copy link

RodneyMKay commented Sep 19, 2023

With the current behavior, the following code throws an unexpected eof error:

val testGrammar = object : Grammar<TokenMatch> {
    val single by literalToken("<")
    val double by literalToken("<<")
    
    override val root = single or double
}

testGrammar.parse("<<")

In this example, the second token is always unreachable. This can easily be solved by the user by swapping the definitions of single and double like so, since the lexer basically goes through the literal tokens in the order they're defined in at the moment:

val testGrammar = object : Grammar<TokenMatch> {
    val double by literalToken("<<")
    val single by literalToken("<")
    
    override val root = single or double
}

testGrammar.parse("<<")

My proposal would be to sort the literal tokens by their length before attempting to parse them. When the lexer attempts to parse longer tokens first, this will always lead to the correct result and the user can no longer define literal tokens that are unreachable in lexing.

I'm not sure you intended for users to be able to control the parse order by the order of the definitions. One negative aspect of this suggestion I thought about is that it could lead to performance overhead, since users can't put more frequently occurring tokens at the top of their grammar to speed up the matching process for those tokens. A compromise here could be a more elaborate algorithm that moves unreachable tokens in front of their reachable counterparts when the lexer is initialized. This still lets people maintain control over most of the parse order and just eliminates the unreachable tokens.

The ultimate decision is not straight forward, but I personally think that the benefits for usability for people new to the library outweigh the drawbacks for experienced users.

@alllex
Copy link
Owner

alllex commented Oct 3, 2023

This is a valid problem. Though, I think sorting of only literal tokens is only partial solution, if at all.

If the grammars consisted solely of the literal tokens this would probably work. However, just by bringing regex-based tokens into play, the length looses its meaning as a priority measure. Furthermore, Parsus allows to define algorithmic tokens (hand-written token parsers) that are defined by arbitrary code.

In more practical sense, accidentally defining a regex-based token regexToken("[a-z]+") before literal tokens literalToken("oops") that would match this regex leads to the same problem. A similar thing can happen with a pair of regex-based tokens.

Unreachable tokens are most likely a signal of a bug in the grammar definition. There is usually no need to declare redundant tokens. It should be more helpful to the user to point out the unreachable tokens, if they can be detected.

I was thinking about this, and the current solution I have in mind is to do a best-effort validation of the grammar on initialization, failing if there are unreachable tokens. This validation would be enabled by default with an ability to opt-out, just in case. But also, the token-creating methods should take an extra parameter allowing to override their priority without forcing users to rearrange class members. Effectively, users would be able to define the sorting order themselves.

This is very similar to what you describe as a more "elaborate algorithm". The algorithm would stay simple, to be straightforward to the users, but the users would get more control.

@alllex
Copy link
Owner

alllex commented Oct 3, 2023

Another approach would be to let the parser dictate the tokenization.

This means that there will be no separate process of tokenization, lazy or not. Instead, we would just try to parse whatever token the parser expects next.

This way, users would be able to control priority on the use-site. In your example, it would be by putting the double before the single:

override val root = double or single

@alllex
Copy link
Owner

alllex commented Oct 4, 2023

Implementing the latter approach proved to be easier than expected. Though, without performance considerations.

Given the other benefits, I am seriously considering switching to it for the mainline.

#23

@RodneyMKay
Copy link
Author

That's actually really cool! I thought about solving the issue this way as well, but figured that was way outside the scope of this library. Well, turns out I was wrong. Thank you very much for your work! ♥️

PS: I'ma close this issue as I'd consider it solved (:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants