Improve performance by moving sequence matching #148

goetas · 2018-11-08T08:21:59Z

Improve performance by moving sequence matching to the string scanner (that has a raw access to the underlying string)

Running the test/benchmark/run.php benchmark:

Before this PR:
Loading: 106.80956840515

After this PR:
Loading: 100.03929138184

goetas · 2018-11-08T08:30:04Z

Nice to see the perf improvements in the latest PRs 😃 (cc @tgalopin )

Running the test/benchmark/run.php benchmark:

v2.3.1 (latest stabile tag)
Loading: 189.51292037964

v2.4-dev (current PR)
Loading: 100.03929138184

tgalopin · 2018-11-08T09:30:31Z

src/HTML5/Parser/Tokenizer.php

        }
-        return false;
+
+        $ref = $this->decodeCharacterReference();


I don't think this variable is necessary, is it?

it used on the next line by the buffer function

Couldn't you just use $this->buffer(this->decodeCharacterReference()); instead? (Probably mostly a style issue, though.)

aaa, not a big deal

tgalopin · 2018-11-08T09:31:27Z

src/HTML5/Parser/Scanner.php

+     */
+    public function sequenceMatches($sequence, $caseSensitive = true)
+    {
+        $portion = substr($this->data, $this->char, strlen($sequence));


Wouldn't using mb_* functions be safer for UTF-8/16 strings?

~~thats true!~~

Not sure about it, as this is used to lookup for html tags, that are always ascii (and mb_*) functions are slower.

If that was the case, most of the functions in the scanner and tokenizer will be broken

As long as you realize you are working on bytes and not characters, using the plain str* functions is fine with UTF-8 encoded strings (and much faster).

In php-typography, I do a lookup whether a given string (DOMText content) contains UTF-8 characters and choose the appropriate function that way. However, that is mainly necessary for determining whether the u flag for regular expressions needs to be used.

(Lookup and replacement of ASCII sequences should be "UTF-8 safe" as no valid multibyte sequence uses ASCII characters. Be careful with preg_*, though, as I've had PCRE generate invalid sequences when operating on certain multibyte characters and not using the u modifier.)

In this specific case, it does not matter as we are looking for a specific string, so substr + strlen and === comparison will work. The issue might occur in case of case-insensitve comparisons.

As long as you are looking for an ASCII sequence, that should still be fine.

tgalopin · 2018-11-08T09:32:10Z

The performance is getting really great, that's cool :) !

goetas added 2 commits November 8, 2018 08:56

improve consume speed

9494e34

move sequenceMatches to the Scanner

5c5634a

tgalopin reviewed Nov 8, 2018

View reviewed changes

tgalopin approved these changes Nov 9, 2018

View reviewed changes

goetas merged commit ed6b64d into 2.x Nov 17, 2018

goetas deleted the perf branch November 23, 2018 06:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance by moving sequence matching #148

Improve performance by moving sequence matching #148

goetas commented Nov 8, 2018

goetas commented Nov 8, 2018

tgalopin Nov 8, 2018

goetas Nov 8, 2018

mundschenk-at Nov 8, 2018 •

edited

Loading

goetas Nov 8, 2018

tgalopin Nov 8, 2018

goetas Nov 8, 2018 •

edited

Loading

goetas Nov 8, 2018

mundschenk-at Nov 8, 2018

mundschenk-at Nov 8, 2018 •

edited

Loading

goetas Nov 8, 2018

mundschenk-at Nov 8, 2018

tgalopin commented Nov 8, 2018

Improve performance by moving sequence matching #148

Improve performance by moving sequence matching #148

Conversation

goetas commented Nov 8, 2018

goetas commented Nov 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mundschenk-at Nov 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goetas Nov 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mundschenk-at Nov 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgalopin commented Nov 8, 2018

mundschenk-at Nov 8, 2018 •

edited

Loading

goetas Nov 8, 2018 •

edited

Loading

mundschenk-at Nov 8, 2018 •

edited

Loading