Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swallowing up text in the parser #122

Open
stuartlangridge opened this issue Jul 6, 2020 · 5 comments
Open

Swallowing up text in the parser #122

stuartlangridge opened this issue Jul 6, 2020 · 5 comments
Labels

Comments

@stuartlangridge
Copy link

stuartlangridge commented Jul 6, 2020

  • parglare version: master
  • Python version: 3.8.2
  • Operating System: Ubuntu 20.04

I have a document which contains a heading, which is a quoted string, and then a series of "sentences" which end with a "." and may have newlines in. I'd like to parse the document into Heading and Sentences. I tried to do it this way:

import parglare

grammar = r"""
Document: Heading Body;

Heading: QuotedString;
Body: Anything;

Sentence: Anything DOT;

terminals

QuotedString: /"(?P<qs>.*?)"/;
Anything: /.*/;
DOT: ".";
"""

text = """

"This is the heading"

This is sentence one.
This is sentence two
which has newlines in.
"""

g = parglare.Grammar.from_string(grammar)
p = parglare.Parser(g, debug=True)
result = p.parse(text)

However, this fails with parglare.exceptions.ParseError: Error at 6:0:"ence one.\n **> This is se" => Expected: DOT but found <Anything(This is sentence two)>.

All I care about is the Heading, and parsing the Body into separate sentences, but I can't work out how to do that; what's the best way to express this in a parglare grammar? The sentences can contain anything at all; I don't need a structure or parsing for them at this stage, just a list with ["This is sentence one.", "This is sentence two which has newlines in."] as the return; sentences might contain any characters at all.

(Apologies if this isn't actually an issue, but I hope it's the best place to ask questions about parglare. I'm happy to ask it somewhere else if that's better.)

@igordejanovic
Copy link
Owner

In Python regexes . by default don't cross line boundaries. To change that you can use ?s inline flag (see re.DOTALL in the Python docs). So your grammar will work correctly with this:

Anything: /(?s).*/;

@igordejanovic
Copy link
Owner

BTW, here is the right place to ask questions about parglare.

@stuartlangridge
Copy link
Author

Ah, now, I tried (?s) (this bug report was originally going to mention DOTALL until I actually read the re documentation and discovered the inline (?s) version, which I didn't know existed :-)) but when I tried it I still got errors, presumably because I don't quite understand it. Example:

import parglare

grammar = r"""
Program: al=AuthorLine sentences=Sentences;
AuthorLine: title=Identifier "by" author=Identifier DOT;

Sentences: Sentence*;
Sentence: Anything DOT;
Identifier: IdentifierWord*;

terminals

IdentifierWord: /\w+/;
DOT: ".";
Anything: /(?s).*?/;
"""

text = """
Program by Stuart.

This is sentence one.
This is sentence two
which has newlines in.
"""

g = parglare.Grammar.from_string(grammar)
p = parglare.Parser(g, debug=True)
result = p.parse(text)

This fails with error:
parglare.exceptions.ParseError: Error at 4:0:" Stuart.\n\n **> This is se" => Expected: Anything or STOP but found <IdentifierWord(This)>

I don't know how to tell parglare "just swallow up the rest of the document, I don't care about parsing it", or "please only detect an IdentifierWord in the context of an AuthorLine and once you've got the AuthorLine, stop parsing" -- I can't boost or decrease the relevance of IdentifierWord with {1} or {99} because it's a terminal, and even then I want to boost it while parsing an AuthorLine and decrease it when not, which I don't understand how to do. Maybe I'm attacking this problem completely the wrong way?

@igordejanovic
Copy link
Owner

The problem is that Anything collects... well anything, even dots :) so Sentence rule never match as it expect DOT after Anything. You can do this:

Anything: /(?s)[^\.]*/;

which means Anything is anything except dot.

Another feature you might find useful, depending on what you are trying to achieve, is incomplete parsing.

@stuartlangridge
Copy link
Author

Incomplete parsing looks like exactly what I want! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants