Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Started with the new lexer implementation #432

Closed
wants to merge 2 commits into from
Closed

Conversation

Razican
Copy link
Member

@Razican Razican commented May 31, 2020

This Pull Request fixes #294.

It changes the following:

  • The lexer now can be created with anything that implements Read. Ideally, we should use either a Cursor<String> if we are reading input from user (for the console, for example), or a buffered reader when reading from files.
  • Adds stream lexing, which can be used with code streams that are not yet completely available.
  • Adds stream parsing. This means that the parser does not need to wait for all the tokens to be lexed before the parsing starts.
  • Adds goal symbols, which means we can correcly identify the difference between a division / and a regular expression literal starting with /.

Note that this is still WIP. I have only laid out the initials with the new cursor for the lexer, but I wanted to have it here in order to have benchmarks soon, and to receive feedback.

@Razican Razican added performance Performance related changes and issues parser Issues surrounding the parser lexer Issues surrounding the lexer labels May 31, 2020
@Razican Razican mentioned this pull request Jun 2, 2020
@Lan2u
Copy link

Lan2u commented Jun 9, 2020

Have been looking through this / how the lexer used to work and I think I have a basic understanding of where stuff is. Is there an area within the new lexer that I could look into working on?

@Razican
Copy link
Member Author

Razican commented Jun 10, 2020

Have been looking through this / how the lexer used to work and I think I have a basic understanding of where stuff is. Is there an area within the new lexer that I could look into working on?

Basically, porting the old lexer to the new architecture would be nice. You can create PRs to this branch. If you find you need something new from the cursor, let me know, and I can add it.

I might have some time this week to finish the unimplemented functions in the cursor. Then we need to have some extra logic to use the goal symbols.

@Lan2u
Copy link

Lan2u commented Jun 10, 2020

Have started by working my way through the old lex() function and moving across code for each of token types

Moved across (if something wasn't implemented before I haven't implemented it yet, TODO's etc. remain)

  • String Literal (already done)
  • Template Literal (in progress, as of 11/06/2020)
  • Numerical Literal (in progress, as of 11/06/2020)
  • Single line comment
  • Keyword/Identifier
  • Punctuation
  • Operators
  • Misc
  • Regex (in progress, as of 12/06/2020)

@Lan2u
Copy link

Lan2u commented Jun 10, 2020

When it comes to matching the start of a token it would be nice to keep the characters matched on as part of the same file that the lexing is done in i.e. in
let token = match next_chr { '\r' | '\n' | '\u{2028}' | '\u{2029}' => Ok(Token::new( TokenKind::LineTerminator, Span::new(start, self.cursor.pos()), )), '"' | '\'' => StringLiteral::new(next_chr).lex(&mut self.cursor, start), TemplateLiteral::BEGIN_CHR => TemplateLiteral::new().lex(&mut self.cursor, start), _ => unimplemented!(), };

I think it would be cleaner to move the '"' | '\'' for StringLiteral into the string file. This could be done by having a Literal::BeginChr(c) which is called for each literal type until one returns true indicating it can start lexing. This obviously might come with some performance hit so there might be a better way - macros? - Ideally something like a c:
#define STRING_LITERAL_CHECKS '"' | '\''

Co-authored-by: Iban Eguia <razican@protonmail.ch>
@Lan2u
Copy link

Lan2u commented Jun 11, 2020

I see the cursor gets ASCII bytes - what about if unicode is used?

@Razican
Copy link
Member Author

Razican commented Jun 12, 2020

This obviously might come with some performance hit so there might be a better way - macros? - Ideally something like a c:
#define STRING_LITERAL_CHECKS '"' | '\''
I see that you created some macros in the PR. I think that's the way to go for now. We'll see if in the future this gets a bit too difficult to maintain or can be improved.

I see the cursor gets ASCII bytes - what about if unicode is used?

The cursor goes through bytes, independently if they are ASCII or not. Then, there is a wrapper that converts them to Unicode if needed.

@Lan2u Lan2u mentioned this pull request Jun 12, 2020
@Lan2u
Copy link

Lan2u commented Jun 12, 2020

@Razican is it possible to allow putting tokens back onto the cursor? It would be useful for handling cases like regex (or alternatively give the option to peek more than a single cursor ahead).

@Razican
Copy link
Member Author

Razican commented Jun 12, 2020

@Razican is it possible to allow putting tokens back onto the cursor? It would be useful for handling cases like regex (or alternatively give the option to peek more than a single cursor ahead).

Yep, we should be able peek at most 4 characters. Maybe during the weekend I have time to implement that in the cursor.

@jasonwilliams
Copy link
Member

Is this PR superseeded by #486 ?

@Lan2u
Copy link

Lan2u commented Jul 4, 2020

Is this PR superseeded by #486 ?

I think so unless @Razican has local changes?

@Razican
Copy link
Member Author

Razican commented Jul 4, 2020

I have no further local changes, we can close this :)

@Razican Razican closed this Jul 4, 2020
@Razican Razican deleted the new_lexer branch July 9, 2020 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lexer Issues surrounding the lexer parser Issues surrounding the parser performance Performance related changes and issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The lexer doesn't take into account goal symbols
3 participants