Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wildcard regex token after literal token does not work #22

Closed
EndzeitBegins opened this issue Oct 3, 2023 · 2 comments
Closed

Wildcard regex token after literal token does not work #22

EndzeitBegins opened this issue Oct 3, 2023 · 2 comments

Comments

@EndzeitBegins
Copy link

EndzeitBegins commented Oct 3, 2023

I read about parsus a while ago and wanted to incorporate it into a multi-platform side-project of mine.
Sadly, I encountered the following behavior. I'm unsure whether I'm using the library wrong or I've encountered a bug, so any help is appreciated.

Basically I want to parse a string consisting of a limited character set, followed by a literal : and then ending in arbitrary text. My actual use case is a little more advanced, but this is the minimal subset which reproduces the problem.

object ProceduralExampleGrammar : Grammar<Pair<String, String>>() {
    private val firstRegex by regexToken("""[A-Za-z0-9-]+""")
    private val literal by literalToken(":")
    private val secondRegex by regexToken(""".+""")

    override val root by parser {
        val first = firstRegex()
        literal()
        val second = secondRegex()
        Pair(first.text, second.text)
    }
}

fun main() {
    val parseResult = ProceduralExampleGrammar.parse("FOO:BAR")
    val pair = parseResult.getOrThrow()
}

The same behavior can be observed, when using the combinator syntax.

object CombinatorExampleGrammar : Grammar<Pair<String, String>>() {
    private val firstRegex by regexToken("""[A-Za-z0-9-]+""")
    private val literal by literalToken(":")
    private val secondRegex by regexToken(""".+""")

    override val root by firstRegex * -literal * secondRegex map { (first, second) ->
        Pair(first.text, second.text)
    }
}

fun main() {
    val parseResult = CombinatorExampleGrammar.parse("FOO:BAR")
    val pair = parseResult.getOrThrow()
}

Am I using parsus wrong? Or have I stumbled upon a bug?

@alllex
Copy link
Owner

alllex commented Oct 3, 2023

I presume you get the following error:

ParseException(MismatchedToken(expected=RegexToken(secondRegex [.+]), found=TokenMatch(token=RegexToken(firstRegex [[A-Za-z0-9-]+]), offset=4, length=3)))

Parsus uses a traditional approach to parsing in general. It is split into two streams of work: tokenization and bottom-up parsing. The tokenization is the first step and it is done independently from the later parsing step.

During tokenization (or lexing), a lexer takes all the tokens in the grammar (in their priority order) and tries to convert the raw stream of characters into a stream of tokens. This is done lazily, but the result is always deterministic, given the set of tokens.

In your example, it works like this:

1. Try match `firstRegex` at position `F` matches `FOO`.
2. Continue and try match `firstRegex` at position `:` -- does not match.
3. Try match `literal` at position `:` -- matches `:`.
4. Continue and try match `firstRegex` at position `B` -- matches `BAR`
5. Tokenization complete

In short, the BAR string matches the first regex, because it is also very permissive, but it's not what your parser expects when it calls secondRegex(). Thus, the error.

The workaround is to introduce an intermediate parser that tries both firstRegex and secondRegex, but just returns the result.

private val firstOrSecond by firstRegex or secondRegex

override val root by parser {
    val first = firstRegex()
    literal()
    val second = firstOrSecond()
    Pair(first.text, second.text)
}

I would have to see if it makes sense to abandon the traditional tokenization approach to make such use-cases easier. Though, it's a deep topic.

@EndzeitBegins
Copy link
Author

EndzeitBegins commented Oct 4, 2023

Hey @alllex, thank you for the fast response! Yes, the error I've got was along the lines of what you've presumed.
Thanks to the knowledge you've provided, I've got my example working now.

Due to the explanation you gave, I can work with parsus now I think. However, I have to say that it behaves quite different from what I've expected from my initial intuition taking a glance at its API. On the other side I haven't worked a lot with tokenization and custom grammars, so there's that.

I've seen that you got started with #23 already, wow! From my understanding, this would change parsus in a way that it behaves more like I've anticipated. If that's the case, I'm looking forward to it.

As this issue is mostly based on a misunderstanding of parsus on my side, and not an actual bug, I'll close the issue. Thank for the quick reply once more!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants