Replies: 5 comments
-
Have you seen this? |
Beta Was this translation helpful? Give feedback.
-
Yes, but the page does not mention anything about delaying error handling to later phases. I know how I can handle parse errors, I'm more concerned whether they should be parse errors if I can make the grammar more flexible by accepting wider set of tokens and then issue errors in later phases. |
Beta Was this translation helpful? Give feedback.
-
The questions you are asking shows that you are moving in the right direction. Unfortunately, I cannot give you a silver bullet, the topic is complex and any solutions are use-case based. You probably should reach out to researches (or researchers) in this area.
That is a both a strong and a weak point of lexer-based parsers, but I seems like you used word
For future references, I know this as recovery points.
As you already noticed, this is what you get by default when you implement a parser, as it is intrinsically a validating parser. In most use-cases, when you are dealing with machine generated content, it is all what you need.
I think you gone a little bit too far here, almost turning a parser into a lexer/tokenizer, but there is middle ground. You will see what I mean in a moment.
Surely, you code in B will contain A somewhere after, in an AST visitation step. It is also will not handle any input, making this solution questionable. There already was an 'alternative' parser, so I am not sure why you have not considered yourself adding an error recovery branch here. auto const color_statement =
"SetColor" >> (int_ >> int_ >> int_ >> -int_ | lexeme["0x" >> hex_char_table] | error_handling_branch) >> eol; I do not know the details of your grammar, but In case if you really need to delay an error reporting until later, there is a more advanced technique that Clang does in its new way to present incomplete parsing information https://clang.llvm.org/docs/InternalsManual.html#recovery-ast. |
Beta Was this translation helpful? Give feedback.
-
Wow, I haven't expected a response after so much time but nonetheless I am satisfied.
This is a very good observation, I haven't noticed it.
This is the thing that I have missed. I have considered more loose grammars (which more or less end up as tokenizers, wasting Spirit potential) and more strict grammars (which stop on smallest error but complicate any diagnostics production). I have thought about "broken-parsed-element" thing but I haven't realized that in such cases I should advance the parser to some recovery point.
This actually seems to be a silver bullet: insert parser recoveries as last variant alternatives. Alternatives within the grammar, due to their nature will usually have some recovery point close to them which makes it a very good place to recover from. And if something fails elsewhere, it will move upward the grammar untill it reaches an alternative. Thus, I feel like with this approach there is a very high chance of producing high-quality errors with very high chance to be able to continue parsing.
This is a lot of info but I got the point: insert a broken-parsed-element that will contain some state, potentially an entire sub-AST but the whole thing is considered an error because it is the broken-parsed-element variant alternative. Thanks a lot! |
Beta Was this translation helpful? Give feedback.
-
Web searching for:
computer science LL1 parser error recovery using follow sets
shows a lot of hits. This one:
https://www.cs.clemson.edu/course/cpsc827/material/LLk/LL%20Error%20Recovery.pdf
mentions FOLLOW(A) sets for a non-terminal, A, as, what I assume,
amounts to a "recovery point". Unfortunately, the only algorithm I know
of requires traversing, recursively, the grammar to calculate the
FOLLOW(A) set for each non-terminal, A. Maybe spirit-x3 can be
adapted to do that, but I'd think it would be prohibitively
compile-time expensive. It could maybe be done in a separate program
run and the calculated FOLLOW(A)'S then somehow #include'd in another
spirit program, say x3-follow, or some similar name, but that's very
speculative. Of course you could "eye-ball" the FOLLOW(A) for any
give non-terminal, A, but if A is complicated, that would likely be
very error prone.
Sorry can't be more help.
…-Larry
On 3/15/2021 5:40 PM, Xeverous wrote:
Wow, I haven't expected a response after so much time but nonetheless
I am satisfied.
Surely, you code in B will contain A somewhere after, in an AST
visitation step.
This is a very good observation, I haven't noticed it.
There already was an 'alternative' parser, so I am not sure why
you have not considered yourself adding an error recovery branch here.
This is the thing that I have missed. I have considered more loose
grammars (which more or less end up as tokenizers, wasting Spirit
potential) and more strict grammars (which stop on smallest error but
complicate any diagnostics production). I have thought about
"broken-parsed-element" thing but I haven't realized that in such
cases I should advance the parser to some recovery point.
advance the parser to an recovery point, report an parsing error,
and if your AST node value for SetColor already a variant, will
set it to some kind of an 'invalid' state.
This actually seems to be a silver bullet: *insert parser recoveries
as last variant alternatives*. Alternatives within the grammar, due to
their nature will usually have some recovery point close to them which
makes it a very good place to recover from. And if something fails
elsewhere, it will move upward the grammar untill it reaches an
alternative. Thus, I feel like with this approach there is a very high
chance of producing high-quality errors with very high chance to be
able to continue parsing.
In case if you really need to delay an error reporting until
later, there is a more advanced technique that Clang does in its
new way to present incomplete parsing information
This is a lot of info but I got the point: insert a
broken-parsed-element that will contain some state, potentially an
entire sub-AST but the whole thing is considered an error because it
is the broken-parsed-element variant alternative.
Thanks a lot!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#630 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEFBAZSSB3DIGDZVC2WVG3TD2EGTANCNFSM4UG32D7A>.
|
Beta Was this translation helpful? Give feedback.
-
I have been using X3 for a while and I have trouble deciding where to handle invalid input so I'm asking for recommendations - perhaps the library authors have more experience. I haven't found any guidelines for this in the documentation. I think problems like this one (and many general recommendations) could be a worthwhile addition to the documentation - the library is focused on a very narrow topic and I predict not all library users are very knowledgeable in this area.
The problem is as follows: the input contains invalid tokens, but they can be someway dealt with. For example, if a C++ compiler sees
int x = foo();
and there is no definition offoo
, it can issue an error and still continue because it knows the type ofx
so further code is not a problem. At worst punctuation problems it can skip to;
and assume that a statement did not exist.I'm having similar issue: I have a lot of keyword-like elements in the grammar but I'm not sure if:
A) Should I make the grammar very precise and fail the parsing upon any slight error in the grammar? (That is, keep it as a parse error)
B) Should I make the grammar more flexible by allowing all kinds of literals that my "language" supports and then output errors in code that traverses the complete AST? (That is, turn parsing errors into semantic errors by making the grammar more flexible)
For example, let's say that in some text language format such lines are valid:
A (precise grammar version)
B (loose grammar with more semantic analysis required)
In the case of A, parsing will fail much more often on any input problem and there will be many complex subgrammars for each "command".
In the case of B, parsing will often succeed; even if the input had something like
SetColor 255 255
orSetColor true
but there will be very few complex grammars as most will utilize the+token >> eol
part, making the parser simpler but requiring more code to work with the AST.Is there any general recommendation for this type of problems or it is so case-specific that I better use my own judgement? Handling semantic errors is much easier for me than syntaxic ones, and in case of semantic errors it's much more likely to make the code go forward because fixing/faking semantic state is way less complex than fixing parser's position and adjacent AST nodes.
Last thing, if B is better, how about such grammars?
In this extreme case of flexibility, the program can accept a huge variety of invalid inputs and the code that works with complete AST will always error upon
broken_literal
but then we have the highest likehood of being able to continue. I have no idea whether always-invalid-things such asbroken_literal
are a good design though.Beta Was this translation helpful? Give feedback.
All reactions