-
Notifications
You must be signed in to change notification settings - Fork 19
Lexical specification? #3
Comments
I believe the plan is to reuse the |
Reusing the |
You can bypass the issues with a lexical specification by using a scannerless grammar. Sadly, any correct and complete Rust lexical grammar or scannerless grammar is naturally context-sensitive. @eternaleye has previously suggested using RCG ("range concatenation grammars") for that purpose and has even given an example of a raw string grammar (sadly I don't have a link on hand). We also have plans for the |
I think librustc_lexer reached the state where it makes sense to provide a specification for it (ie, it's not rustc specific anymore). I think doing the actual spec work is the only thing that's left here, there are no other blockers. I wouldn't do the spec work myself, but I'll be happy to share my thoughts about it 😜 What to spec?The minimal lexer API is struct Token {
kind: TokenKind // field-less enum
len: usize // len > 0;
// sum of all lengths is equal to the input text length;
// each partial sum is a char_boundary;
} That is, is a sequence of Unicode code points is either broken up into segments, or it contains an (unspecified) lexer error. The fact that encoding is utf-8 is the spec layer below, the fact that What are the tokens?
Some more words on split vs join: because lexer produces whitespace tokens, the split models comes for free. Really, the join model in rustc is wrong, and we should just fix it: rust-lang/rust#63689 TestsEach test is a pair of text files: an It's important to provide a minimal testsuite. That is, using code-coverage tools provide a small set of test-cases that hits all the branches of Declarative implIt should be pretty easy to provide an For this, I suggest to use OrganizationIt would be best if the tests, declarative impl, and harness that runs rustc_lexer and declarative impl against each over, lives in Future plansIt would be sweet to replace hand-crafted rustc_lexer with a generated state machine! |
The number token is an interesting case, because it has a lookahead quality to it, to prevent being greedy. It was mentioned that it may have made sense to break integer literals into parts with a joint flag, the way punctuation is; perhaps it would make sense for the lexer specification to do the same? (Unfortunately, that means it would be different from the Anyway, here's an informal transcription of (*UTF8)(?Axus)(?(DEFINE)
# begin unicode monkeypatch
(?P<Pattern_White_Space> [\t\n\v\f\r\x20\x85\x{200e}\x{200f}\x{2028}\x{2029}])
(?P<XID_Start> [_\p{L}\p{Nl}]) # approximate
(?P<XID_Continue> [\pL\p{Nl}\p{Mn}\p{Mc}\p{Nd}\p{Pc}]) # approximate
# end unicode monkeypatch
# begin shared internals
(?P<Single_Quoted_String>
(?(?=.') .'
| (?:(?!['/]|\n[^']|\z) (?>\\\\|.))* '?
)
)
(?P<Double_Quoted_String> (?>\\[\\"]|[^"]) "? )
(?P<Raw_Double_Quoted_String> (?P<hashes>\#*) .*? (?:(?P=hashes)"|\z) )
# end shared internals
(?P<LineComment> // [^\n]* )
(?P<BlockComment>
/\*
(?P<_BlockComment_Inner>
(?: (?!(?:/\*|\*/)) . )*
(?:(?P>BlockComment) (?P>_BlockComment_Inner))?
)
(?:\*/|\z)
)
(?P<Whitespace> (?P>Pattern_White_Space)+ )
(?P<Ident> (?P>XID_Start) (?P>XID_Continue)* )
(?P<RawIdent> r\# (?P=Ident) )
(?P<Literal_Int>
(?: (?:0[bo])? [0-9]+ | 0x[_0-9a-fA-F]* )
(?! \. (?! \.|(?P>XID_Start) ) [_0-9]* )
(?! [eE] )
(?P>Ident)?
)
(?P<Literal_Float>
(?: (?:0[bo])? [0-9]+ | 0x[_0-9a-fA-F]* )
(?: \. (?! \.|(?P>XID_Start) ) )?
(?: [eE] [+-]? [_0-9]* )?
(?P>Ident)?
)
(?P<Literal_Char>
'
(?(?=.'|(?!(?P>XID_Start)|[0-9]))
(?(?=[^\\]')
. '
| (?P>Single_Quoted_String)
)
(?P>Ident)?
| . (?P>XID_Continue)* '
)
)
(?P<Literal_Byte> b' (?P>Single_Quoted_String) (?P>Ident)?)
(?P<Literal_Str> " (?P>Double_Quoted_String) (?P>Ident)?)
(?P<Literal_ByteStr> b" (?P>Double_Quoted_String) (?P>Ident)? )
(?P<Literal_RawStr> r (?:[#"]) (?P>Raw_Double_Quoted_String) (?P>Ident)? )
(?P<Literal_RawByteStr> br (?:[#"]) (?P>Raw_Double_Quoted_String) (?P>Ident)? )
(?P<Lifetime> ' (?:(?P>XID_Start)|[0-9]) (?P>XID_Continue)* (?!') )
(?P<Semi> ;)
(?P<Comma> ,)
(?P<Dot> \.)
(?P<OpenParen> \()
(?P<CloseParen> \))
(?P<OpenBrace> \{)
(?P<CloseBrace> \})
(?P<OpenBracket> \[)
(?P<CloseBracket> \])
(?P<At> @)
(?P<Pound> \#)
(?P<Tilde> ~)
(?P<Question> \?)
(?P<Colon> :)
(?P<Dollar> \$)
(?P<Eq> =)
(?P<Not> !)
(?P<Lt> <)
(?P<Gt> >)
(?P<Minus> -)
(?P<And> &)
(?P<Or> \|)
(?P<Plus> \+)
(?P<Star> \*)
(?P<Slash> /)
(?P<Caret> \^)
(?P<Percent> %)
(?<Unknown> .)
)
# And the actual lexer
(?> (?P<line_comment>(?P>LineComment))
| (?P<block_comment>(?P>BlockComment))
| (?P<whitespace>(?P>Whitespace))
| (?P<ident>(?P>Ident))
| (?P<raw_ident>(?P>RawIdent))
| (?P<literal_int>(?P>Literal_Int))
| (?P<literal_float>(?P>Literal_Float))
| (?P<literal_char>(?P>Literal_Char))
| (?P<literal_str>(?P>Literal_Str))
| (?P<literal_bytestr>(?P>Literal_ByteStr))
| (?P<literal_rawstr>(?P>Literal_RawStr))
| (?P<literal_rawbytestr>(?P>Literal_RawByteStr))
| (?P<lifetime>(?P>Lifetime))
| (?P<semi>(?P>Semi))
| (?P<comma>(?P>Comma))
| (?P<openparen>(?P>OpenParen))
| (?P<closeparen>(?P>CloseParen))
| (?P<openbrace>(?P>OpenBrace))
| (?P<openbracket>(?P>OpenBracket))
| (?P<closebracket>(?P>CloseBracket))
| (?P<at>(?P>At))
| (?P<pound>(?P>Pound))
| (?P<tilde>(?P>Tilde))
| (?P<question>(?P>Question))
| (?P<colon>(?P>Colon))
| (?P<dollar>(?P>Dollar))
| (?P<eq>(?P>Eq))
| (?P<not>(?P>Not))
| (?P<lt>(?P>Lt))
| (?P<gt>(?P>Gt))
| (?P<minus>(?P>Minus))
| (?P<and>(?P>And))
| (?P<or>(?P>Or))
| (?P<plus>(?P>Plus))
| (?P<star>(?P>Star))
| (?P<slash>(?P>Slash))
| (?P<caret>(?P>Caret))
| (?P<percent>(?P>Percent))
| (?P<unknown>(?P>Unknown))
) (I am sorry, because this is a monster of a regex. I have put every effort into porting this correctly, but unfortunately, because it's a stupidly complex regex, I can't guarantee that it's correct.) |
@eddyb well that uses PCRE features, so the regex crate (rightly) refuses to compile it. Regex101 supports PCRE regexes, and if you click on the link, you'll see that the test case of rustc_lexer is there, being lexed correctly. (except for a missing call to the close brace pattern... whoops) |
Oh, I see, you're using named captures ( |
(Named captures aren't the issue, With that I have a much more reasonable regex to propose can be used as the lexical specification, as described by @matklad. This regex is a regular regex and accepted by the It does make some simplifying assumptions over
(?x)
(?P<line_comment> // [^\n]* )
| (?P<block_comment> /\* ) # parse_block_comment
| (?P<whitespace> \p{Pattern_White_Space}+ )
| (?P<ident> (?:r\#)? \p{XID_Start} \p{XID_Continue}* )
| (?P<lifetime> ' \p{XID_Start} \p{XID_Continue}* )
| (?P<binary_int> 0b [01]+ )
| (?P<octal_int> 0o [0-7]+ )
| (?P<decimal_int> [0-9]+ )
| (?P<hexadecimal_int> 0x [0-9a-fA-F]+ )
| (?P<character> ' (?: [^\\'] | \\ ['"] | \\ [nrt\\0] | \\ x [0-7][0-9a-fA-F] | \\ u \{[0-9a-fA-F]{1,6}} ) ' )
| (?P<string> " (?: [^\\"] | \\ ['"] | \\ [nrt\\0] | \\ x [0-7][0-9a-fA-F] | \\ u \{[0-9a-fA-F]{1,6}} )* " )
| (?P<raw_string> r \#* " ) # parse_raw_string
| (?P<byte> b ' (?: [^\\'] | \\ ['"] | \\ [nrt\\0] | \\ x [0-9a-fA-F]{2} ) ' )
| (?P<byte_string> b " (?: [^\\"] | \\ ['"] | \\ [nrt\\0] | \\ x [0-9a-fA-F]{2} ) " )
| (?P<raw_byte_string> b r \#* " ) # parse_raw_string
| (?P<punct> [;,.(){}\[\]@\#~?:$=!<>\-&|+*/^%]) Further notes on this regex to transform it into a lexical specification:
(I would write the specification function matchers in pseudocode, LaTeX algorithm/algorithmic style, but we all speak Rust here.) If this looks appropriate as the lowest level lexer specification, I can and am willing to do the work to pull The obvious downside is that this would degrade errors 🙁 because the specification lexer should reject all malformed tokens, while a useful lexer should accept a superset of well-formed tokens, and just make sure to error on malformed tokens. (For example, single quote strings and digit lifetimes are accepted by (Edit: I haven't actually specifically tested what characters are allowed in a character literal, and the spec above allows anything other than Edit 2: this isn't quite perfect. What breaks it? So who's wrong here? rustc or the proposed spec? Or should this be an error in the "lexer cooking" phase? (What was the behavior pre- |
This is very much a benefit, and I'd even say a requirement for spec lexer. The prop that we want to check is: forall text: &str.
match (speck_lexer(text), rustc_lexer(text)) {
(Ok(speck_tokens), Ok(rustc_token)) => speck_tokens == rustc_tokens,
(Err(()), Err(_)) => true,
(Ok(_), Err(_)) | (Err(()), Ok(_)) => false
}
Curious, do we have tests for this in rust-lang/rust? If not, yay for speck finding uncovered shady areas! |
There are some tests for this corner case: https://github.com/rust-lang/rust/blob/master/src/test/ui/parser/lex-bad-numeric-literals.rs But the problem is, all of these errors should error, but that's because they're invalid code, not necessarily because they're invalid tokens. Digging into this further, I've set up a simple proc macro to let us see the // lib.proc-macro = true
#[proc_macro]
pub fn tokenize(tokens: TokenStream) -> TokenStream {
panic!("TOKENS = {:#?}", tokens);
} The fact that this panics is very important, as apparently rustc emits what look like lexer errors for bad number literals (e.g. tokenize!(
0o1.0
0o2f32
0o3.0f32
0o4e4
0o5.0e5
0o6e6f32
0o7.0e7f64
0x8.0e+9
0x9.0e-9
0o
1e+
0x539.0
9900000000000000000000000000999999999999999999999999999999
9900000000000000000000000000999999999999999999999999999999
0x
0xu32
0ou32
0bu32
0b
0o123f64
0o123.456
0b101f64
0b111.101
);
(spans snipped, errors manually compressed) So there are a few philosophical questions here:
And a fun note: put Now I think any lexer specification is going to need to be three parts:
Having I'm going to make a rough draft of a specification matching this three step process, as well as a canonical implementation to test against Edit: whoops, already found an ICE! rust-lang/rust#70787 |
I'd avoid that in the final spec as a matter of making the lexical specification more readable (and so that it gives references for the gluing of the lexical spec into the overall grammar.
Hmm, are underscores not handled here -- is this part of the cooking?
Some denotational semantics perhaps ("function matchers") for the spec (using Rust for conversation is all fine though)? Using Rust as part of the meta-language for specification makes things a bit loopy, which is probably sub-optimal. Also, doc comments seem to be interpreted as comments in your lexical spec? |
That, honestly, was me overlooking them. Because e.g.
This is how I've been going all over the place, so here's some concrete questions for the other involved parties:
Having mulled on it a little bit more, I think that these cases where rustc is trying to continue after already emitting an error diagnostic should be considered as non-standard attempts to provide better error messages, and specified as errors. |
:)
This seems like the simpler choice to me in terms of avoiding logic duplication. In general, I'm in favor of the "more languages / IRs and compositions between them"-approach for better separation of concerns.
I think we do try to recover, but I'm not sure if we do so on malformed token streams specifically. You may find rust-lang/rust#70074 relevant.
Recovering + erroring vs. aborting immediately on the first error can be though of as the same thing from the POV of "is this a valid program?" so I think it's not "non-standard", but also not relevant to the spec. (In general, we can think of compilation as being under some error monad, e.g., |
It's a very raw draft, but here's a draft for a "raw lexer" specification. Based on experimentation, it seems like the "tricky parts" that require cooking are:
The stages I'm currently working with are
|
The lexical specification draft now currently lives in https://github.com/CAD97/rust-lexical-spec. I'd appreciate any more eyes on it while it's still a draft; I'll probably PR it to this repository once I've got a reference impl for the whole path and it tests against |
What do you think of using pest. The Elegant Parser and a parser expression grammar (PEG)? |
Should the group produce a separate lexical specification?
I think it would be useful, but I have no idea what it would look like.
There are some complexities like token splitting (
<<
to<
<
) that rustc performs that I don't know how that would be expressed in a formal grammar. Also, weak keywords. AIUI, raw strings also introduce complexities.I would like to hear if others think this would be useful, how it would work, what the complications might be, etc.
Also relevant: https://internals.rust-lang.org/t/pre-pre-rfc-canonical-lexer-specification/4099
The text was updated successfully, but these errors were encountered: