Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the Tokenizer performance #147

Merged
merged 3 commits into from
Nov 8, 2018
Merged

Conversation

tgalopin
Copy link
Contributor

@tgalopin tgalopin commented Nov 5, 2018

After working on #146, I realized the Tokenizer was the next bottleneck of the parser. This PR improves its performance by inlining text parsing and removing some Scanner::current calls.

As you can see in https://blackfire.io/profiles/8d809c0b-1a22-476e-9800-756d7e0440ba/graph (current 2.x branch), there are two main bottlenecks: Scanner::current (which is fixed by removing the InputStream in #146) and Tokenizer::consumeData.

This PR improves the state of Tokenizer::consumeData by doing two things:

  • removing unnecessary calls to Scanner::current;
  • inlining the logic dedicated to characters parsing, as it is by far the most important execution path of the consumeData method;

This creates a small duplication of code (which is not even really a duplication, as the code is optimized for the context of consumeData), but it greatly improves performances:

Test script

Note: I changed the fixture file between this PR and #146 to use one which should be closer to real cases. This is why times are different.

require __DIR__.'/../vendor/autoload.php';

$html = file_get_contents(__DIR__.'/fixture.html');

$start = microtime(true);
$total = 100;

for ($i = 0; $i < $total; $i++) {
    $parser = new \Masterminds\HTML5();
    $parser->loadHTMLFragment($html);
}

$time = microtime(true) - $start;

echo 'Total time: '.round($time * 1000)."ms\n";
echo 'Time per iteration: '.round(($time * 1000) / $total, 2)."ms\n";

Simple time measurement

Before:

Total time: 2104ms
Time per iteration: 21.04ms

After:

Total time: 1784ms
Time per iteration: 17.84ms

Blackfire analysis

Note that Blackfire introduces an overhead on functions and methods calls, which enhance the difference. This is why I also did simple time measurements.

@tgalopin
Copy link
Contributor Author

tgalopin commented Nov 5, 2018

I just added an additional commit to inline the parsing of tags too (critical path as well and easy to inline without duplication). The new benchmarks using the same script are:

Before:

Total time: 2104ms
Time per iteration: 21.04ms

After:

Total time: 1626ms
Time per iteration: 16.26ms


// Inline the parsing of characters as it's the critical performance path
// Parse tag
if ($this->scanner->current() === '<') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could move the $tok = $this->scanner->current(); from line 144 to before the if-Statement to remove another $this->scanner->current() call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I can because the content of the if change the scanner position, so I need to get the current one after it. I may be wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a problem, as line 132 ($tok = $this->scanner->next();) inside the if overrides the previously set $tok anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a look inside the "or calls" : they do change the current position of the cursor :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah:)

Still, you could move the call before and add another one to pick up the current scanner state at the end of the if clause. Then you'd have the same number of function calls for the "begin tag" case, but one less for all other cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea indeed! I tried and it does improve even more the performance :) .

@tgalopin
Copy link
Contributor Author

tgalopin commented Nov 6, 2018

I just added another improvement from the idea of @mundschenk-at:

Total time: 1580ms
Time per iteration: 15.8ms

Given how impactful each current call removal is, I will try in another PR to reduce the number of calls even more.

@goetas goetas merged commit a48091c into Masterminds:2.x Nov 8, 2018
@tgalopin tgalopin deleted the tokenizer-perfs branch November 8, 2018 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants