-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance by moving sequence matching #148
Conversation
Nice to see the perf improvements in the latest PRs 😃 (cc @tgalopin ) Running the
|
} | ||
return false; | ||
|
||
$ref = $this->decodeCharacterReference(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this variable is necessary, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it used on the next line by the buffer function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you just use $this->buffer(this->decodeCharacterReference());
instead? (Probably mostly a style issue, though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aaa, not a big deal
*/ | ||
public function sequenceMatches($sequence, $caseSensitive = true) | ||
{ | ||
$portion = substr($this->data, $this->char, strlen($sequence)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't using mb_*
functions be safer for UTF-8/16 strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats true!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about it, as this is used to lookup for html tags, that are always ascii (and mb_*) functions are slower.
If that was the case, most of the functions in the scanner and tokenizer will be broken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as you realize you are working on bytes and not characters, using the plain str*
functions is fine with UTF-8 encoded strings (and much faster).
In php-typography
, I do a lookup whether a given string (DOMText
content) contains UTF-8 characters and choose the appropriate function that way. However, that is mainly necessary for determining whether the u
flag for regular expressions needs to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Lookup and replacement of ASCII sequences should be "UTF-8 safe" as no valid multibyte sequence uses ASCII characters. Be careful with preg_*
, though, as I've had PCRE generate invalid sequences when operating on certain multibyte characters and not using the u
modifier.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this specific case, it does not matter as we are looking for a specific string, so substr
+ strlen
and ===
comparison will work. The issue might occur in case of case-insensitve comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as you are looking for an ASCII sequence, that should still be fine.
The performance is getting really great, that's cool :) ! |
Improve performance by moving sequence matching to the string scanner (that has a raw access to the underlying string)
Running the
test/benchmark/run.php
benchmark: