Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TASK: Implement alternative approach to lexical analysis #34

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

grebaldi
Copy link
Member

@grebaldi grebaldi commented Aug 9, 2023

solves: #3, #33

This PR introduces a Lexer class that implements a very different approach to lexical analysis than the existent Tokenizer:

  • it uses generalized matchers
    • (most) matchers are stateless and can tell whether a given character at a given offset matches their specification
      • only exception is Sequence that uses state to switch between multiple subsequent matchers
  • it works on-demand
    • parsers specify which token types they expect next, the lexer then differentiates only within that given set
  • it handles multi-byte characters
    • multi-byte characters are interpreted correctly, and counted as one character each
    • this is achieved without the use of expensive mb_* functions

I focused somewhat on memory-efficiency and expect that the implementation will be more economical on memory use than the Tokenizer.

Token types have also changed to a more rigid set. This will simplify some of the parser implementations later on.

As of right now, I'm not too sure of this approach and expect things to break when I turn to the parser implementations. I'm also not quite sure if I haven't missed anything on the multi-byte character handling (seems too easy to me 😅). It'll require more tests further down the line to be on the safe side with this. For now, all of this is just an experiment.

Remaining TODOs

  • Implement new Lexer
  • Reform all parsers to use new Lexer
  • Remove Tokenizer and all other old concepts
  • Refactor Lexer and increase test coverage
  • Parse negative integers both in expression and integer literal
  • TemplateLiteral lines must have indentation >= of that of their TemplateLiteralNode

@mhsdesign
Copy link
Contributor

mhsdesign commented Aug 9, 2023

Wait a second, isnt this similar to the philosophy of the new fusion parser? ❤️

edit it seems more sophisticated ^^ would love to talk about this ^^

$source,
TokenTypes::from(
TokenType::TEMPLATE_LITERAL_DELIMITER,
TokenType::SPACE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interestring. because we allow the space token here its not a TEMPLATE_LITERAL_CONTENT (but it probably should be in real world ^^)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the TemplateLiteralParser will read this differently :)

use PackageFactory\ComponentEngine\Parser\Source\Position;
use PackageFactory\ComponentEngine\Parser\Source\Range;

final class Buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm im just wondering what are the pros and cons of making this thing mutable ...

on the one side, the lexer can expose it as public readonly member but methods like override and reset might always be smelly. Then again, this mutable buffer might be a performance optimization, as we dont need a new object every time.

Base automatically changed from task/30/split-parsing-logic to main September 27, 2023 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants