Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance by moving sequence matching #148

Merged
merged 2 commits into from
Nov 17, 2018
Merged

Improve performance by moving sequence matching #148

merged 2 commits into from
Nov 17, 2018

Conversation

goetas
Copy link
Member

@goetas goetas commented Nov 8, 2018

Improve performance by moving sequence matching to the string scanner (that has a raw access to the underlying string)

Running the test/benchmark/run.php benchmark:

Before this PR:
Loading: 106.80956840515

After this PR:
Loading: 100.03929138184

@goetas
Copy link
Member Author

goetas commented Nov 8, 2018

Nice to see the perf improvements in the latest PRs 😃 (cc @tgalopin )

Running the test/benchmark/run.php benchmark:

v2.3.1 (latest stabile tag)
Loading: 189.51292037964

v2.4-dev (current PR)
Loading: 100.03929138184

}
return false;

$ref = $this->decodeCharacterReference();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this variable is necessary, is it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it used on the next line by the buffer function

Copy link
Contributor

@mundschenk-at mundschenk-at Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you just use $this->buffer(this->decodeCharacterReference()); instead? (Probably mostly a style issue, though.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaa, not a big deal

*/
public function sequenceMatches($sequence, $caseSensitive = true)
{
$portion = substr($this->data, $this->char, strlen($sequence));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't using mb_* functions be safer for UTF-8/16 strings?

Copy link
Member Author

@goetas goetas Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats true!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about it, as this is used to lookup for html tags, that are always ascii (and mb_*) functions are slower.

If that was the case, most of the functions in the scanner and tokenizer will be broken

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as you realize you are working on bytes and not characters, using the plain str* functions is fine with UTF-8 encoded strings (and much faster).

In php-typography, I do a lookup whether a given string (DOMText content) contains UTF-8 characters and choose the appropriate function that way. However, that is mainly necessary for determining whether the u flag for regular expressions needs to be used.

Copy link
Contributor

@mundschenk-at mundschenk-at Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Lookup and replacement of ASCII sequences should be "UTF-8 safe" as no valid multibyte sequence uses ASCII characters. Be careful with preg_*, though, as I've had PCRE generate invalid sequences when operating on certain multibyte characters and not using the u modifier.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this specific case, it does not matter as we are looking for a specific string, so substr + strlen and === comparison will work. The issue might occur in case of case-insensitve comparisons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as you are looking for an ASCII sequence, that should still be fine.

@tgalopin
Copy link
Contributor

tgalopin commented Nov 8, 2018

The performance is getting really great, that's cool :) !

@goetas goetas merged commit ed6b64d into 2.x Nov 17, 2018
@goetas goetas deleted the perf branch November 23, 2018 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants