Wildcard regex token after literal token does not work #22

EndzeitBegins · 2023-10-03T13:46:19Z

I read about parsus a while ago and wanted to incorporate it into a multi-platform side-project of mine.
Sadly, I encountered the following behavior. I'm unsure whether I'm using the library wrong or I've encountered a bug, so any help is appreciated.

Basically I want to parse a string consisting of a limited character set, followed by a literal : and then ending in arbitrary text. My actual use case is a little more advanced, but this is the minimal subset which reproduces the problem.

object ProceduralExampleGrammar : Grammar<Pair<String, String>>() {
    private val firstRegex by regexToken("""[A-Za-z0-9-]+""")
    private val literal by literalToken(":")
    private val secondRegex by regexToken(""".+""")

    override val root by parser {
        val first = firstRegex()
        literal()
        val second = secondRegex()
        Pair(first.text, second.text)
    }
}

fun main() {
    val parseResult = ProceduralExampleGrammar.parse("FOO:BAR")
    val pair = parseResult.getOrThrow()
}

The same behavior can be observed, when using the combinator syntax.

object CombinatorExampleGrammar : Grammar<Pair<String, String>>() {
    private val firstRegex by regexToken("""[A-Za-z0-9-]+""")
    private val literal by literalToken(":")
    private val secondRegex by regexToken(""".+""")

    override val root by firstRegex * -literal * secondRegex map { (first, second) ->
        Pair(first.text, second.text)
    }
}

fun main() {
    val parseResult = CombinatorExampleGrammar.parse("FOO:BAR")
    val pair = parseResult.getOrThrow()
}

Am I using parsus wrong? Or have I stumbled upon a bug?

The text was updated successfully, but these errors were encountered:

alllex · 2023-10-03T19:55:44Z

I presume you get the following error:

ParseException(MismatchedToken(expected=RegexToken(secondRegex [.+]), found=TokenMatch(token=RegexToken(firstRegex [[A-Za-z0-9-]+]), offset=4, length=3)))

Parsus uses a traditional approach to parsing in general. It is split into two streams of work: tokenization and bottom-up parsing. The tokenization is the first step and it is done independently from the later parsing step.

During tokenization (or lexing), a lexer takes all the tokens in the grammar (in their priority order) and tries to convert the raw stream of characters into a stream of tokens. This is done lazily, but the result is always deterministic, given the set of tokens.

In your example, it works like this:

1. Try match `firstRegex` at position `F` matches `FOO`.
2. Continue and try match `firstRegex` at position `:` -- does not match.
3. Try match `literal` at position `:` -- matches `:`.
4. Continue and try match `firstRegex` at position `B` -- matches `BAR`
5. Tokenization complete

In short, the BAR string matches the first regex, because it is also very permissive, but it's not what your parser expects when it calls secondRegex(). Thus, the error.

The workaround is to introduce an intermediate parser that tries both firstRegex and secondRegex, but just returns the result.

private val firstOrSecond by firstRegex or secondRegex

override val root by parser {
    val first = firstRegex()
    literal()
    val second = firstOrSecond()
    Pair(first.text, second.text)
}

I would have to see if it makes sense to abandon the traditional tokenization approach to make such use-cases easier. Though, it's a deep topic.

EndzeitBegins · 2023-10-04T18:07:00Z

Hey @alllex, thank you for the fast response! Yes, the error I've got was along the lines of what you've presumed.
Thanks to the knowledge you've provided, I've got my example working now.

Due to the explanation you gave, I can work with parsus now I think. However, I have to say that it behaves quite different from what I've expected from my initial intuition taking a glance at its API. On the other side I haven't worked a lot with tokenization and custom grammars, so there's that.

I've seen that you got started with #23 already, wow! From my understanding, this would change parsus in a way that it behaves more like I've anticipated. If that's the case, I'm looking forward to it.

As this issue is mostly based on a misunderstanding of parsus on my side, and not an actual bug, I'll close the issue. Thank for the quick reply once more!

alllex mentioned this issue Oct 4, 2023

Scannerless parsing #23

Merged

EndzeitBegins closed this as completed Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wildcard regex token after literal token does not work #22

Wildcard regex token after literal token does not work #22

EndzeitBegins commented Oct 3, 2023 •

edited

Loading

alllex commented Oct 3, 2023

EndzeitBegins commented Oct 4, 2023 •

edited

Loading

Wildcard regex token after literal token does not work #22

Wildcard regex token after literal token does not work #22

Comments

EndzeitBegins commented Oct 3, 2023 • edited Loading

alllex commented Oct 3, 2023

EndzeitBegins commented Oct 4, 2023 • edited Loading

EndzeitBegins commented Oct 3, 2023 •

edited

Loading

EndzeitBegins commented Oct 4, 2023 •

edited

Loading