-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the Tokenizer performance #147
Conversation
…ome Scanner::current calls
I just added an additional commit to inline the parsing of tags too (critical path as well and easy to inline without duplication). The new benchmarks using the same script are: Before:
After:
|
src/HTML5/Parser/Tokenizer.php
Outdated
|
||
// Inline the parsing of characters as it's the critical performance path | ||
// Parse tag | ||
if ($this->scanner->current() === '<') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could move the $tok = $this->scanner->current();
from line 144 to before the if
-Statement to remove another $this->scanner->current()
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I can because the content of the if change the scanner position, so I need to get the current one after it. I may be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a problem, as line 132 ($tok = $this->scanner->next();
) inside the if
overrides the previously set $tok
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a look inside the "or calls" : they do change the current position of the cursor :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah:)
Still, you could move the call before and add another one to pick up the current scanner state at the end of the if
clause. Then you'd have the same number of function calls for the "begin tag" case, but one less for all other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea indeed! I tried and it does improve even more the performance :) .
I just added another improvement from the idea of @mundschenk-at:
Given how impactful each |
After working on #146, I realized the Tokenizer was the next bottleneck of the parser. This PR improves its performance by inlining text parsing and removing some Scanner::current calls.
As you can see in https://blackfire.io/profiles/8d809c0b-1a22-476e-9800-756d7e0440ba/graph (current 2.x branch), there are two main bottlenecks:
Scanner::current
(which is fixed by removing the InputStream in #146) andTokenizer::consumeData
.This PR improves the state of
Tokenizer::consumeData
by doing two things:Scanner::current
;This creates a small duplication of code (which is not even really a duplication, as the code is optimized for the context of consumeData), but it greatly improves performances:
Test script
Simple time measurement
Before:
After:
Blackfire analysis