-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Imported Lexer Grammar is lower priority #2209
Comments
Files: import-test-case.zip |
@parrt I'm bumping into this issue on a very large lexer import (500+ token types). What's happening is that when using As an example, given the following grammars:
The generated lexer will parse the following input : "imported stuff" as [ IDENTIFIER, IDENTIFIER ], instead of the expected [ IMPORTED, IDENTIFIER ]. I can fix this if you agree with the proposed (latter) behavior. Let me know what you think. |
Interesting. I definitely remember thinking about how the rules should be overridden, but the tokenVocab thing might be something we didn't think about. I guess my first thought is that tokenVocab should be completely ignored in anything that is important, such as:
My argument for it being ignored is that the main grammar is what should dictate what the overall sequence of token definitions are... |
Tbh I don't think ignoring |
The 'KeywordVSIDOrder' test, which I just changed to make the tests pass in #4044 , seems meant to verify the precedence intent. It's the same pattern as above. |
Viewing this issue from a C(++) developer angle, it would make more sense to consider the import like an include, that is, the import statement is actually replaced with the imported rules + tokens. This way you can define the order of the precedence yourself. If you have the import statement at the end of the main grammar then the imported rules are append, otherwise they are inserted at whatever position the import statement is. Overriding rules could be like this: the last occurrence of a rule supersedes any previous definition. This way you can have, say, an ID rule in an imported grammar (which might be used in other grammars as well), but override it in a specific main grammar by redefining it after the import. But you could do the reverse too: have an ID rule in the main grammar and redefine it in an imported grammar, if the import is done after the ID rule definition. For the tokenVocab it's the same. This vocabulary only defines a mapping of a token rule name to a token value (number). If such a name is defined again (either in a following import or a rule with the same name) this mapping is redefined with the new token value (if imported from another grammar's tokenVocab) or the rule (re)definition (in either an imported grammar or the main grammar). Important for this approach to work is that you import recursively. First, all imported grammar must be resolved with their imports, tokenVocabs and rules and only after that they can be imported into another grammar (where the merge happens again at the next higher level). Obviously, this approach would require to enhance the ANTLR4 syntax to allow imports at any position on grammar level. |
Hmm, tokenVocab merging is tricky and must be done differently, or should I say: prohibited? If token values from imported grammars can override token values in the main tokenVocab then we may break rules before the import statement, which use the original token value. In such a case ANTLR4 should issue at least a warning that a token value is being shadowed and might produce unexpected results, if not prohibiting such cases entirely. If not prohibited a possible solution would be to import a grammar before any rule that use a token value which is being shadowed/replaced. |
I like Mike's suggestion, which doesn't require any new keyword. It's still a breaking change though, because currently imports at the beginning of a grammar behave as if they were at the end. |
A non-breaking change would be to add an |
Right, so I'm thinking about if we could add a new grammar option to enable the new import semantics. At a later point in time we can make it default to true and even later remove it, once we can be sure most existing grammars have been migrated. Though, I'm not sure if that effort is really necessary, given that most grammars do not use imports. |
The Antlr book is pretty clear that
|
All options in the imported file are currently ignored - when you include an
|
Agreed, what’s less clear is how to deal with the tokens from tokenVocab no longer being definedEnvoyé de mon iPhoneLe 26 déc. 2022 à 17:28, Ross Patterson ***@***.***> a écrit :
All options in the imported file are currently ignored - when you include an options statement in one, you get a warning:
warning(109): RexxParser.g4:3:0: options ignored in imported grammar RexxParser
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
Ah, so this seems to be what I suggested at the top: to ignore. |
Yeah, I can check that part out once we figure out the precedence issue. I just realized that we really need to different concepts, one for importing parser rules and one for importing lexer rules. For parsers, it's obvious to me that like inheritance we want to override imported parser rules. For lexers, however, we're not only want to replace rules but we need a way to specify order because introducing a new lexer rule can affect recognition of others, unlike in the parser. BUT, I think we have a catch 22 situation where I can define situation where imported tokens should be given precedence and another case where they should not be. In Eric's case, he is importing keyword token that should be given precedence over the identifier rule in the main lexer. But what about the case where the imported grammar has the identifier rule and we just want to add some more keywords in the main lexer? In that situation, we wanted to behave as it does now, putting the imported lexical rules at the bottom of the grammar. OK, I think you've convinced me that we have situation we cannot handle, which might or might not be the more common case (importing keywords). I can also see wanting to pull in common lexical rules like numbers and identifiers, which is no doubt what I was thinking when I defined the import mechanism. So, changing the behavior to flip it simply breaks the other case, which is also unsatisfying. I have not been paying attention for a decade really... Does this case come up very often? Are we trying to solve a problem that's just not that common? |
If it's true that it's a common problem (which I'm hoping people from the mailing list will see after the holidays and jump in here), then we need something like @mike-lischke's solution where we get to specify the location. Rather than break the existing stuff, it would make sense to deprecate it for lexers and introduce a variant that was location specific, which would mean enhancing the grammar to allow Let's wait to hear from others about just how serious this problem is. Obviously the current mechanism doesn't handle both cases but how often do people really do lexer imports? No idea actually. haha. Can't they just copy and paste lexer rules so we can avoid breaking stuff? |
I guess the problem is common enough to trigger this conversation? Imho, splitting large lexer grammars into smaller reusable segments is good practice towards quality and readability, and by supporting local imports, we (antlr) could then provide a set of reusable constructs: integer, float, string, multi-line comment, ws and id being some that immediately come to mind. FWIW my workaround this time has been to move ID rule at the end of the imported lexer, which goes against my need (isolate some reusable keywords) and stinks like a hack. Forget about my PR which is definitely breaking, and goes against the original intent of imports. I'd suggest creating an option: lexerImportMode accepting values: 'prepend', 'append'(the default) and 'inline'; |
Well, only takes one to trigger ;) For now, I think my pref is to leave as-is and ask users to adapt. |
Example.g4
Example2.g4
Running:
Moving the UNKNOWN rule into Example2:
Results in the expected output:
The text was updated successfully, but these errors were encountered: