RFC: Reduce lexer to a regular language #2755
Labels
A-frontend
Area: Compiler frontend (errors, parsing and HIR)
A-grammar
Area: The grammar of Rust
A-syntaxext
Area: Syntax extensions
C-cleanup
Category: PRs that clean code up or issues documenting cleanup.
E-easy
Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue.
Milestone
The lexer presently affords one real and one planned form of recursive token. These mean that our "tokens" are not actually describable by a regular language. We discussed this at some length on IRC today and came up with solutions for both cases, so I would like to reduce the lexer back down to "just regular".
The cases are:
/*
and then consumes a balanced set of possibly-nested/*
and*/
pairs. These exist for only one reason, which is to be able to comment-out a region of a file that already contains a comment. The solution we arrived at is to differentiate the problems of "commenting for the sake of writing some non-rust text like docs or such" and "commenting in order to disable code". For the former case, we'll maintain non-balanced block comments (described by a shortest-match regexp) and for the latter case we'll introduce a syntax extension called#ignore(...)
that just discards its token-tree (including any block-comments, which are just single tokens). The corner case is that you won't be able to comment-out blocks that contain mixtures of both other-block-comments and random non-token lexemes, but that's far less common and (imo) worth sacrificing.q{...}
brackets do. Thinking about this in the cold light of the question "is it enough of a feature to require the lexer to be non-regular?", though, I have to say no. Python-like raw strings are probably adequate -- or possiblyq{...}
quotes without automatic balancing -- and there's nothing really stopping a syntax extension from picking apart a string-literal token provided this way. I no longer think it's worth the complexity cost.So given that, it should be only a couple patches to the lexer to get it back under the "regular" threshold, and possibly at that point we could drop in actual regexp definition of our tokens (binding to an existing re engine, or writing our own, I don't care. It should be a linear one in any case, something like http://code.google.com/p/re2/ or a clone if you feel like doing the exercise in rust).
The text was updated successfully, but these errors were encountered: