Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance by relying on a native string instead of InputStream #146

Merged
merged 1 commit into from
Nov 8, 2018

Conversation

tgalopin
Copy link
Contributor

@tgalopin tgalopin commented Nov 3, 2018

When I used the library in the context of https://github.com/tgalopin/html-sanitizer, I stumbled upon speed issues. I launched a Blackfire analysis and saw it mainly came from this library.

I started to investigate and quickly found out that the architecture with an InputStream and a Scanner was the biggest part of the problem: calling 3 methods for each characters of the input slowed a lot the parsing process.

In addition to that, I realized that the InputStream architecture, while being very extensible, is not used in the library itself: the FileStreamInput is a plain StringInput with a file_get_contents, and the StringInput is a simple wrapper around a native string.

Finally, I think it is not likely that anyone using this library did it in a way where they can't transform their input into a string.

Thus, I would like to propose in this pull request to simplify the library critical execution path by deprecating (and removing in 3.0) the InputStream architecture and relying on a native string directly in the Scanner instead.

This implementation shows great speed improvements:

Test script

require __DIR__.'/../vendor/autoload.php';

$html = file_get_contents(__DIR__.'/fixture.html');

$start = microtime(true);
$total = 100;

for ($i = 0; $i < $total; $i++) {
    $parser = new \Masterminds\HTML5();
    $parser->loadHTMLFragment($html);
}

$time = microtime(true) - $start;

echo 'Total time: '.round($time * 1000)."ms\n";
echo 'Time per iteration: '.round(($time * 1000) / $total, 2)."ms\n";

Simple time measurement

Before:

Total time: 935ms
Time per iteration: 9.35ms

After:

Total time: 731ms
Time per iteration: 7.31ms

Blackfire analysis

Note that Blackfire introduces an overhead on functions and methods calls, which enhance the difference. This is why I also did simple time measurements.

I am fully aware this could be a bit controversial PR but I really think it would be a great improvement for this library, especially in the context of Drupal. In many cases, we can expect a 22% speed improvement, which is worth the removal of InputStream in my opinion :) !

@tgalopin
Copy link
Contributor Author

tgalopin commented Nov 4, 2018

I just pushed an additional performance gain achieved by removing some calls to the Scanner::current() method in the Tokenizer.

I updated the script to run 1000 iterations in order to get more precise results. Here are the results for the whole PR now:

Before:

Total time: 9361ms
Time per iteration: 9.36ms

After removing the InputStream system:

Total time: 7559ms
Time per iteration: 7.56ms

After (whole PR):

Total time: 7389ms
Time per iteration: 7.39ms

@PhenX
Copy link

PhenX commented Nov 4, 2018

@tgalopin while reducing parsing time is very good, did you measure memory usage too? I think streams were used in order to reduce memory footprint.

@tgalopin
Copy link
Contributor Author

tgalopin commented Nov 4, 2018

Thanks for the feedback :) !

InputStream instances were not actually streams, they were just wrappers around a plain string :) . The memory footprint is even lower in the new version (1.25 MB to 1.17MB, as you can see in the Blackfire profiles), probably because of the removal of the objects :) .

I do think though that using real streams may be a great idea for further improvements. I'll perhaps have a look if I find how to do it properly :) .

@goetas
Copy link
Member

goetas commented Nov 4, 2018

Hi!
Thanks a lot for the PR! Really like the idea.

Will take me some time to review it and to try it but is definitively a 👍

@goetas
Copy link
Member

goetas commented Nov 4, 2018

@mattfarina @technosophos Anything against the general approach in this PR?

@tgalopin
Copy link
Contributor Author

tgalopin commented Nov 4, 2018

I removed the tokenizer improvements from this PR to focus on the InputStream here. I'm still working on the tokenizer, I'll open a new PR once ready. Note that the two improvements are fully independent, no need to wait for the other if we want to merge this one :) .

@tgalopin
Copy link
Contributor Author

tgalopin commented Nov 5, 2018

Here it is: #147 :) !

@goetas
Copy link
Member

goetas commented Nov 8, 2018

thanks a lot for this!
This lib already had a benchmark in test/benchmark/run.php,

Results on my laptop:

before this PR:
Loading: 193.65500211716


after this PR:
Loading: 139.94578361511

Great work!

@goetas goetas merged commit 563687a into Masterminds:2.x Nov 8, 2018
@tgalopin tgalopin deleted the remove-input-streams branch November 8, 2018 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants