Improve the Tokenizer performance #147

tgalopin · 2018-11-05T00:12:40Z

After working on #146, I realized the Tokenizer was the next bottleneck of the parser. This PR improves its performance by inlining text parsing and removing some Scanner::current calls.

As you can see in https://blackfire.io/profiles/8d809c0b-1a22-476e-9800-756d7e0440ba/graph (current 2.x branch), there are two main bottlenecks: Scanner::current (which is fixed by removing the InputStream in #146) and Tokenizer::consumeData.

This PR improves the state of Tokenizer::consumeData by doing two things:

removing unnecessary calls to Scanner::current;
inlining the logic dedicated to characters parsing, as it is by far the most important execution path of the consumeData method;

This creates a small duplication of code (which is not even really a duplication, as the code is optimized for the context of consumeData), but it greatly improves performances:

Test script

Note: I changed the fixture file between this PR and #146 to use one which should be closer to real cases. This is why times are different.

require __DIR__.'/../vendor/autoload.php';

$html = file_get_contents(__DIR__.'/fixture.html');

$start = microtime(true);
$total = 100;

for ($i = 0; $i < $total; $i++) {
    $parser = new \Masterminds\HTML5();
    $parser->loadHTMLFragment($html);
}

$time = microtime(true) - $start;

echo 'Total time: '.round($time * 1000)."ms\n";
echo 'Time per iteration: '.round(($time * 1000) / $total, 2)."ms\n";

Simple time measurement

Before:

Total time: 2104ms
Time per iteration: 21.04ms

After:

Total time: 1784ms
Time per iteration: 17.84ms

Blackfire analysis

Note that Blackfire introduces an overhead on functions and methods calls, which enhance the difference. This is why I also did simple time measurements.

…ome Scanner::current calls

tgalopin · 2018-11-05T00:37:52Z

I just added an additional commit to inline the parsing of tags too (critical path as well and easy to inline without duplication). The new benchmarks using the same script are:

Before:

Total time: 2104ms
Time per iteration: 21.04ms

After:

Total time: 1626ms
Time per iteration: 16.26ms

mundschenk-at · 2018-11-05T06:53:44Z

src/HTML5/Parser/Tokenizer.php


-        // Inline the parsing of characters as it's the critical performance path
+        // Parse tag
+        if ($this->scanner->current() === '<') {


You could move the $tok = $this->scanner->current(); from line 144 to before the if-Statement to remove another $this->scanner->current() call.

I don't think I can because the content of the if change the scanner position, so I need to get the current one after it. I may be wrong.

I don't think this is a problem, as line 132 ($tok = $this->scanner->next();) inside the if overrides the previously set $tok anyway.

Have a look inside the "or calls" : they do change the current position of the cursor :)

Ah:)

Still, you could move the call before and add another one to pick up the current scanner state at the end of the if clause. Then you'd have the same number of function calls for the "begin tag" case, but one less for all other cases.

Great idea indeed! I tried and it does improve even more the performance :) .

tgalopin · 2018-11-06T09:37:20Z

I just added another improvement from the idea of @mundschenk-at:

Total time: 1580ms
Time per iteration: 15.8ms

Given how impactful each current call removal is, I will try in another PR to reduce the number of calls even more.

Improve Tokenizer performance by inlining text parsing and removing s…

b3ef91f

…ome Scanner::current calls

tgalopin mentioned this pull request Nov 5, 2018

Improve performance by relying on a native string instead of InputStream #146

Merged

Inline tag open in Tokenizer to further improve performances

f7a954d

mundschenk-at reviewed Nov 5, 2018

View reviewed changes

Remove another current call

7ac198d

goetas merged commit a48091c into Masterminds:2.x Nov 8, 2018

tgalopin deleted the tokenizer-perfs branch November 8, 2018 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the Tokenizer performance #147

Improve the Tokenizer performance #147

tgalopin commented Nov 5, 2018

tgalopin commented Nov 5, 2018

mundschenk-at Nov 5, 2018

tgalopin Nov 5, 2018

mundschenk-at Nov 5, 2018

tgalopin Nov 5, 2018

mundschenk-at Nov 5, 2018

tgalopin Nov 6, 2018

tgalopin commented Nov 6, 2018

Improve the Tokenizer performance #147

Improve the Tokenizer performance #147

Conversation

tgalopin commented Nov 5, 2018

tgalopin commented Nov 5, 2018

mundschenk-at Nov 5, 2018

Choose a reason for hiding this comment

tgalopin Nov 5, 2018

Choose a reason for hiding this comment

mundschenk-at Nov 5, 2018

Choose a reason for hiding this comment

tgalopin Nov 5, 2018

Choose a reason for hiding this comment

mundschenk-at Nov 5, 2018

Choose a reason for hiding this comment

tgalopin Nov 6, 2018

Choose a reason for hiding this comment

tgalopin commented Nov 6, 2018