TASK: Implement alternative approach to lexical analysis #34

grebaldi · 2023-08-09T12:54:57Z

solves: #3, #33

This PR introduces a Lexer class that implements a very different approach to lexical analysis than the existent Tokenizer:

it uses generalized matchers
- (most) matchers are stateless and can tell whether a given character at a given offset matches their specification
  - only exception is Sequence that uses state to switch between multiple subsequent matchers
it works on-demand
- parsers specify which token types they expect next, the lexer then differentiates only within that given set
it handles multi-byte characters
- multi-byte characters are interpreted correctly, and counted as one character each
- this is achieved without the use of expensive mb_* functions

I focused somewhat on memory-efficiency and expect that the implementation will be more economical on memory use than the Tokenizer.

Token types have also changed to a more rigid set. This will simplify some of the parser implementations later on.

As of right now, I'm not too sure of this approach and expect things to break when I turn to the parser implementations. I'm also not quite sure if I haven't missed anything on the multi-byte character handling (seems too easy to me 😅). It'll require more tests further down the line to be on the safe side with this. For now, all of this is just an experiment.

Remaining TODOs

Implement new Lexer
Reform all parsers to use new Lexer
Remove Tokenizer and all other old concepts
Refactor Lexer and increase test coverage
Parse negative integers both in expression and integer literal
TemplateLiteral lines must have indentation >= of that of their TemplateLiteralNode

mhsdesign · 2023-08-09T13:07:42Z

Wait a second, isnt this similar to the philosophy of the new fusion parser? ❤️

edit it seems more sophisticated ^^ would love to talk about this ^^

mhsdesign · 2023-08-10T17:21:36Z

test/Unit/Language/Lexer/LexerTest.php

+            $source,
+            TokenTypes::from(
+                TokenType::TEMPLATE_LITERAL_DELIMITER,
+                TokenType::SPACE,


interestring. because we allow the space token here its not a TEMPLATE_LITERAL_CONTENT (but it probably should be in real world ^^)

Yep, the TemplateLiteralParser will read this differently :)

mhsdesign · 2023-08-17T15:41:29Z

src/Language/Lexer/Buffer/Buffer.php

+use PackageFactory\ComponentEngine\Parser\Source\Position;
+use PackageFactory\ComponentEngine\Parser\Source\Range;
+
+final class Buffer


hmm im just wondering what are the pros and cons of making this thing mutable ...

on the one side, the lexer can expose it as public readonly member but methods like override and reset might always be smelly. Then again, this mutable buffer might be a performance optimization, as we dont need a new object every time.

...and move the (rule -> matcher) cache concern over to the Scanner class.

grebaldi added 2 commits August 9, 2023 14:41

TASK: Implement Lexer

c2122b5

TASK: Enable zend.assertions in CI

5022652

mhsdesign reviewed Aug 10, 2023

View reviewed changes

grebaldi mentioned this pull request Aug 10, 2023

TASK: Split parsing logic from AST objects #31

Merged

29 tasks

grebaldi force-pushed the task/3/cleanup-tokenizer branch from 1e67c62 to ba5a324 Compare August 11, 2023 09:36

TASK: Prepare Lexer interface for parser use cases

4f5b603

grebaldi force-pushed the task/3/cleanup-tokenizer branch from ba5a324 to 4f5b603 Compare August 11, 2023 09:38

grebaldi added 2 commits August 11, 2023 13:05

TASK: Reform all parsers to use new Lexer

dd1625d

TASK: Remove Tokenizer and all related obsolete concepts

66e0e7c

grebaldi force-pushed the task/3/cleanup-tokenizer branch from a78721c to 66e0e7c Compare August 11, 2023 11:32

grebaldi added 7 commits August 11, 2023 15:36

TASK: Expose buffer of lexer and remove Token class

158fbc1

TASK: Rename TokenType -> Rule

0883066

TASK: Split Scanner from Lexer

c50ac6a

TASK: Replace remaining references to "TokenType"

5552c70

TASK: Remove method Lexer::getRuleUnderCursor and replace call-sites

2ef8989

TASK: Streamline Lexer interface by exposing buffer object directly

0d35b1f

TASK: Remove superfluous classes under CharacterStream\\*

6ce0e9a

mhsdesign reviewed Aug 17, 2023

View reviewed changes

TASK: Remove Rules class

714673b

mhsdesign mentioned this pull request Aug 17, 2023

WIP: Feature: Try out parsica #20

Draft

grebaldi added 6 commits August 18, 2023 15:18

TASK: Merge scan and scanOneOf methods into unified scan method

1009734

TASK: Replace all *OneOf methods of Lexer class

f7e3382

TASK: Turn abstract Matcher class into interface

9b1d34d

...and move the (rule -> matcher) cache concern over to the Scanner class.

TASK: Add method getRemainder to Scanner class

dcbf92f

TASK: Parse negative integer literals

d7e1799

BUGFIX: Fix shebang in scripts

9fc4801

Base automatically changed from task/30/split-parsing-logic to main September 27, 2023 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TASK: Implement alternative approach to lexical analysis #34

TASK: Implement alternative approach to lexical analysis #34

grebaldi commented Aug 9, 2023 •

edited

Loading

mhsdesign commented Aug 9, 2023 •

edited

Loading

mhsdesign Aug 10, 2023

grebaldi Aug 10, 2023

mhsdesign Aug 17, 2023

TASK: Implement alternative approach to lexical analysis #34

Are you sure you want to change the base?

TASK: Implement alternative approach to lexical analysis #34

Conversation

grebaldi commented Aug 9, 2023 • edited Loading

mhsdesign commented Aug 9, 2023 • edited Loading

mhsdesign Aug 10, 2023

Choose a reason for hiding this comment

grebaldi Aug 10, 2023

Choose a reason for hiding this comment

mhsdesign Aug 17, 2023

Choose a reason for hiding this comment

grebaldi commented Aug 9, 2023 •

edited

Loading

mhsdesign commented Aug 9, 2023 •

edited

Loading