Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imported Lexer Grammar is lower priority #2209

Open
kasbah opened this issue Jan 24, 2018 · 19 comments
Open

Imported Lexer Grammar is lower priority #2209

kasbah opened this issue Jan 24, 2018 · 19 comments

Comments

@kasbah
Copy link
Contributor

kasbah commented Jan 24, 2018

Example.g4

grammar Example;
import Example2;

main: NUMBER;

UNKNOWN: . ;

Example2.g4

lexer grammar Example2;

NUMBER: [0-9];

Running:

echo "10 abc" | grun Example main -diagnostics -tree -tokens
[@0,0:0='1',<UNKNOWN>,1:0]
[@1,1:1='0',<UNKNOWN>,1:1]
[@2,2:2=' ',<UNKNOWN>,1:2]
[@3,3:3='a',<UNKNOWN>,1:3]
[@4,4:4='b',<UNKNOWN>,1:4]
[@5,5:5='c',<UNKNOWN>,1:5]
[@6,6:6='\n',<UNKNOWN>,1:6]
[@7,7:6='<EOF>',<EOF>,2:0]
line 1:0 mismatched input '1' expecting NUMBER
(main 1 0   a b c \n)

Moving the UNKNOWN rule into Example2:

--- a/Example.g4
+++ b/Example.g4
@@ -3,4 +3,3 @@ import Example2;
 
 main: NUMBER;
 
-UNKNOWN: . ;

--- a/Example2.g4
+++ b/Example2.g4
@@ -1,3 +1,5 @@
 lexer grammar Example2;
 
 NUMBER: [0-9];
+
+UNKNOWN: . ;

Results in the expected output:

[@0,0:0='1',<NUMBER>,1:0]
[@1,1:1='0',<NUMBER>,1:1]
[@2,2:2=' ',<UNKNOWN>,1:2]
[@3,3:3='a',<UNKNOWN>,1:3]
[@4,4:4='b',<UNKNOWN>,1:4]
[@5,5:5='c',<UNKNOWN>,1:5]
[@6,6:6='\n',<UNKNOWN>,1:6]
[@7,7:6='<EOF>',<EOF>,2:0]
(main 1)
@kasbah
Copy link
Contributor Author

kasbah commented Jan 24, 2018

Files: import-test-case.zip

peter-takacs added a commit to software-engineering-amsterdam/endless-ql that referenced this issue Mar 13, 2018
@ericvergnaud
Copy link
Contributor

ericvergnaud commented Dec 25, 2022

@parrt I'm bumping into this issue on a very large lexer import (500+ token types).

What's happening is that when using tokenVocab option, lexer rules in the main grammar not only supersede rules in the imported grammar (which is expected and desirable), they also take precedence. For some reason, this does not happen when not using tokenVocab option.

As an example, given the following grammars:

lexer grammar ImportedLexer

IMPORTED: 'imported';
parser grammar ImportedParser

options { tokenVocab = ImportedLexer; }
some_rule: IMPORTED;
lexer grammar MainLexer

import ImportedLexer;

IDENTIFIER: [a-zA-Z]+;

parser grammar MainParser

options { tokenVocab = MainLexer; }

import ImportedParser;

start_rule: IMPORTED IDENTIFIER;

The generated lexer will parse the following input : "imported stuff" as [ IDENTIFIER, IDENTIFIER ], instead of the expected [ IMPORTED, IDENTIFIER ].

I can fix this if you agree with the proposed (latter) behavior.

Let me know what you think.

@ericvergnaud
Copy link
Contributor

@parrt I've fixed the issue, PR #4044 works for me.
This is a breaking change for developers using composite lexers.
However I suspect that most developers affected by this change will welcome it.

@parrt
Copy link
Member

parrt commented Dec 25, 2022

Interesting. I definitely remember thinking about how the rules should be overridden, but the tokenVocab thing might be something we didn't think about. I guess my first thought is that tokenVocab should be completely ignored in anything that is important, such as:

parser grammar ImportedParser
options { tokenVocab = ImportedLexer; }  <=== ignore

My argument for it being ignored is that the main grammar is what should dictate what the overall sequence of token definitions are...

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Dec 26, 2022

Tbh I don't think ignoring tokenVocab=ImportedLexer is causing the issue, rather it's the logic for processing imported grammars. Imported grammar rules are appended rather than prepended, which imho is counter intuitive, and kinda makes it mandatory to define greedy lexer rules (such as IDENTIFIER) in imported grammars rather than in the main grammar. The tokenVocab option is mandatory anyways because without it the the tokens used in the parser grammar are considered unknown. See the tentative fix in PR #4044.

@ericvergnaud
Copy link
Contributor

The 'KeywordVSIDOrder' test, which I just changed to make the tests pass in #4044 , seems meant to verify the precedence intent. It's the same pattern as above.
It seems that your thinking was that an imported grammar would be a good place for rules that are frequently reused across grammars (such as ID). The unfortunate consequence is that it goes directly against "the main grammar is what should dictate what the overall sequence of token definitions" proposal (which I support), because the main grammar no longer controls the precedence of greedy lexer rules. In an extreme example, if A imports B and B imports C, greedy rules have to be at the end of C...
You might hate me for suggesting this but we could enrich the 'import X' statement to specify whether the imported rules should be prepended or appended. This would have the benefit of not breaking any existing grammar.

@mike-lischke
Copy link
Member

mike-lischke commented Dec 26, 2022

Viewing this issue from a C(++) developer angle, it would make more sense to consider the import like an include, that is, the import statement is actually replaced with the imported rules + tokens. This way you can define the order of the precedence yourself. If you have the import statement at the end of the main grammar then the imported rules are append, otherwise they are inserted at whatever position the import statement is.

Overriding rules could be like this: the last occurrence of a rule supersedes any previous definition. This way you can have, say, an ID rule in an imported grammar (which might be used in other grammars as well), but override it in a specific main grammar by redefining it after the import. But you could do the reverse too: have an ID rule in the main grammar and redefine it in an imported grammar, if the import is done after the ID rule definition.

For the tokenVocab it's the same. This vocabulary only defines a mapping of a token rule name to a token value (number). If such a name is defined again (either in a following import or a rule with the same name) this mapping is redefined with the new token value (if imported from another grammar's tokenVocab) or the rule (re)definition (in either an imported grammar or the main grammar).

Important for this approach to work is that you import recursively. First, all imported grammar must be resolved with their imports, tokenVocabs and rules and only after that they can be imported into another grammar (where the merge happens again at the next higher level).

Obviously, this approach would require to enhance the ANTLR4 syntax to allow imports at any position on grammar level.

@mike-lischke
Copy link
Member

Hmm, tokenVocab merging is tricky and must be done differently, or should I say: prohibited?

If token values from imported grammars can override token values in the main tokenVocab then we may break rules before the import statement, which use the original token value.

In such a case ANTLR4 should issue at least a warning that a token value is being shadowed and might produce unexpected results, if not prohibiting such cases entirely. If not prohibited a possible solution would be to import a grammar before any rule that use a token value which is being shadowed/replaced.

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Dec 26, 2022

I like Mike's suggestion, which doesn't require any new keyword. It's still a breaking change though, because currently imports at the beginning of a grammar behave as if they were at the end.
At this point, I don't think we need to change anything with tokenVocab behavior, except maybe document it for the imported grammars edge case.

@ericvergnaud
Copy link
Contributor

A non-breaking change would be to add an include clause, which as suggested by Mike could be located anywhere...

@mike-lischke
Copy link
Member

I like Mike's suggestion, which doesn't require any new keyword. It's still a breaking change though, because currently imports at the beginning of a grammar behave as if they were at the end.

Right, so I'm thinking about if we could add a new grammar option to enable the new import semantics. At a later point in time we can make it default to true and even later remove it, once we can be sure most existing grammars have been migrated. Though, I'm not sure if that effort is really necessary, given that most grammars do not use imports.

@RossPatterson
Copy link
Contributor

The Antlr book is pretty clear that imported files don't override existing rules. From "Grammar imports", page 257:

ANTLR treats imported grammars very much like object-oriented programming languages treat superclasses. A grammar inherits all of the rules, tokens specifications, and named actions from the imported grammar. Rules in the "main grammar" override rules from imported grammars to implement inheritance.

Think of import as more like a smart include statement (which does not include rules that are already defined). The result of all imports is a single combined grammar; the ANTLR code generator sees a complete grammar and has no idea there were imported grammars.

@RossPatterson
Copy link
Contributor

All options in the imported file are currently ignored - when you include an options statement in one, you get a warning:

warning(109): RexxParser.g4:3:0: options ignored in imported grammar RexxParser

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Dec 26, 2022 via email

@parrt
Copy link
Member

parrt commented Dec 29, 2022

All options in the imported file are currently ignored - when you include an options statement in one, you get a warning:

Ah, so this seems to be what I suggested at the top: to ignore.

@parrt
Copy link
Member

parrt commented Dec 29, 2022

Agreed, what’s less clear is how to deal with the tokens from tokenVocab no longer being defined

Yeah, I can check that part out once we figure out the precedence issue.

I just realized that we really need to different concepts, one for importing parser rules and one for importing lexer rules. For parsers, it's obvious to me that like inheritance we want to override imported parser rules. For lexers, however, we're not only want to replace rules but we need a way to specify order because introducing a new lexer rule can affect recognition of others, unlike in the parser.

BUT, I think we have a catch 22 situation where I can define situation where imported tokens should be given precedence and another case where they should not be. In Eric's case, he is importing keyword token that should be given precedence over the identifier rule in the main lexer. But what about the case where the imported grammar has the identifier rule and we just want to add some more keywords in the main lexer? In that situation, we wanted to behave as it does now, putting the imported lexical rules at the bottom of the grammar.

OK, I think you've convinced me that we have situation we cannot handle, which might or might not be the more common case (importing keywords). I can also see wanting to pull in common lexical rules like numbers and identifiers, which is no doubt what I was thinking when I defined the import mechanism.

So, changing the behavior to flip it simply breaks the other case, which is also unsatisfying.

I have not been paying attention for a decade really... Does this case come up very often? Are we trying to solve a problem that's just not that common?

@parrt
Copy link
Member

parrt commented Dec 29, 2022

If it's true that it's a common problem (which I'm hoping people from the mailing list will see after the holidays and jump in here), then we need something like @mike-lischke's solution where we get to specify the location. Rather than break the existing stuff, it would make sense to deprecate it for lexers and introduce a variant that was location specific, which would mean enhancing the grammar to allow import anywhere in a lexer grammar.

Let's wait to hear from others about just how serious this problem is. Obviously the current mechanism doesn't handle both cases but how often do people really do lexer imports? No idea actually. haha. Can't they just copy and paste lexer rules so we can avoid breaking stuff?

@ericvergnaud
Copy link
Contributor

I guess the problem is common enough to trigger this conversation?
It's actually not the first time I bump into this, although it's the first time I gain clarity about the issue.
At minimal the current behavior is counter-intuitive.

Imho, splitting large lexer grammars into smaller reusable segments is good practice towards quality and readability, and by supporting local imports, we (antlr) could then provide a set of reusable constructs: integer, float, string, multi-line comment, ws and id being some that immediately come to mind.

FWIW my workaround this time has been to move ID rule at the end of the imported lexer, which goes against my need (isolate some reusable keywords) and stinks like a hack.

Forget about my PR which is definitely breaking, and goes against the original intent of imports.

I'd suggest creating an option: lexerImportMode accepting values: 'prepend', 'append'(the default) and 'inline';
This wouldn't break anything and requires minimal changes in the tool (I'm happy to submit a PR).

@parrt
Copy link
Member

parrt commented Dec 31, 2022

I guess the problem is common enough to trigger this conversation?

Well, only takes one to trigger ;) For now, I think my pref is to leave as-is and ask users to adapt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants