Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of ctype dependency #173

Closed
alecpl opened this issue Nov 20, 2019 · 9 comments · Fixed by #206
Closed

Get rid of ctype dependency #173

alecpl opened this issue Nov 20, 2019 · 9 comments · Fixed by #206

Comments

@alecpl
Copy link
Contributor

alecpl commented Nov 20, 2019

This library uses ctype_alpha() function in two places.

  1. If I'm not wrong this function is locale dependent and I don't see any setlocale() calls. So, would be better to have something that does not depend on locale.
  2. If we got rid of this function use the library would not need Ctype extension which is not always installed by default.

I think it should be easy to replace this function with something based on ord() or str[c]spn().

@alecpl
Copy link
Contributor Author

alecpl commented Nov 22, 2019

I see that the function is used with single-character input. For this use-case the best I could find is:

function is_alpha($input)
{
    $code = ord($input);
    return ($code >= 97 && $code <= 122) || ($code >= 65 && $code <= 90);
}

It is 2x slower than ctype_alpha() however for 1 million calls it's 0.08 sec. vs 0.04 sec. So, I'd go for it.

@goetas
Copy link
Member

goetas commented Nov 22, 2019

Will be happy to get rid of the dependency.

Can we make this an optional dependency, using something as function_exists, or use a static closure with the dependency already resolved to a polyfill function or to the extension?

@alecpl
Copy link
Contributor Author

alecpl commented Nov 25, 2019

Polyfill exists - https://github.com/symfony/polyfill-ctype/blob/master/Ctype.php. But it is slower than my version which is optimized for use with single byte string.

As for function_exists... because of the potential locale-related issue I would rather get rid of ctype_alpha. Using my method will be an easy change. Maybe it would be possible to get rid of that function by changing the code that uses it, e.g. using some Scanner methods instead, but at a first look it does not seem to be that simple.

@alecpl
Copy link
Contributor Author

alecpl commented Nov 25, 2019

I have two solutions, both have 2% impact on performance:

  1. Use a simple method in place of ctype_alpha:
--- a/src/HTML5/Parser/Tokenizer.php
+++ b/src/HTML5/Parser/Tokenizer.php
@@ -137,7 +137,7 @@ class Tokenizer
                 $this->endTag();
             } elseif ('?' === $tok) {
                 $this->processingInstruction();
-            } elseif (ctype_alpha($tok)) {
+            } elseif ($this->is_alpha($tok)) {
                 $this->tagName();
             } else {
                 $this->parseError('Illegal tag opening');
@@ -347,7 +347,7 @@ class Tokenizer
         // > -> parse error
         // EOF -> parse error
         // -> parse error
-        if (!ctype_alpha($tok)) {
+        if (!$this->is_alpha($tok)) {
             $this->parseError("Expected tag name, got '%s'", $tok);
             if ("\0" == $tok || false === $tok) {
                 return false;
@@ -1186,4 +1186,10 @@ class Tokenizer
 
         return '&' . $entity;
     }
+
+    protected function is_alpha($input)
+    {
+        $code = ord($input);
+        return ($code >= 97 && $code <= 122) || ($code >= 65 && $code <= 90);
+    }
 }
  1. Rewrite the code to not check if current char is alpha:
--- a/src/HTML5/Parser/Tokenizer.php
+++ b/src/HTML5/Parser/Tokenizer.php
@@ -137,9 +137,7 @@ class Tokenizer
                 $this->endTag();
             } elseif ('?' === $tok) {
                 $this->processingInstruction();
-            } elseif (ctype_alpha($tok)) {
-                $this->tagName();
-            } else {
+            } elseif (false === $this->tagName()) {
                 $this->parseError('Illegal tag opening');
                 // TODO is this necessary ?
                 $this->characterData();
@@ -343,20 +341,24 @@ class Tokenizer
         }
         $tok = $this->scanner->next();
.
-        // a-zA-Z -> tagname
-        // > -> parse error
         // EOF -> parse error
         // -> parse error
-        if (!ctype_alpha($tok)) {
+        if ("\0" == $tok || false === $tok) {
+            $this->parseError("Expected tag name, got '%s'", $tok);
+
+            return false;
+        }
+
+        // Get tag name (or just it's start)
+        $name = $this->scanner->getAsciiAlpha();
+
+        if (false === $name || '' === $name) {
             $this->parseError("Expected tag name, got '%s'", $tok);
-            if ("\0" == $tok || false === $tok) {
-                return false;
-            }
.
             return $this->bogusComment('</');
         }
.
-        $name = $this->scanner->charsUntil("\n\f \t>");
+        $name .= $this->scanner->charsUntil("\n\f \t>");
         $name = self::CONFORMANT_XML === $this->mode ? $name : strtolower($name);
         // Trash whitespace.
         $this->scanner->whitespace();
@@ -379,8 +381,15 @@ class Tokenizer
      */
     protected function tagName()
     {
-        // We know this is at least one char.
-        $name = $this->scanner->charsWhile(':_-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz');
+        // Get tag name (or just it's start)
+        $name = $this->scanner->getAsciiAlpha();
+
+        if (false === $name || '' === $name) {
+            return false;
+        }
+
+        // Get the rest of tag name
+        $name .= $this->scanner->charsWhile(':_-' . Scanner::CHARS_ALNUM);
         $name = self::CONFORMANT_XML === $this->mode ? $name : strtolower($name);
         $attributes = array();
         $selfClose = false;

The second patch might require some more love. For example I see that Scanner methods return false on EOF. I think this is not really needed (could return an empty string) and not properly documented.

@alecpl
Copy link
Contributor Author

alecpl commented Nov 25, 2019

Extending the first patch with static function_exists check for ctype_alpha does not really have an impact on performance. It's still 1-2 % slower.

Performance test done using run.php script on PHP 7.3.

@mundschenk-at
Copy link
Contributor

To have any performance benefits, you'd probably have to do that in the constructor and assign a callable to a property to avoid additional function calls. Also, the polyfill would have to support ctype_alpha's signature to avoid an additional wrapper function.

@IonBazan
Copy link

IonBazan commented Feb 3, 2020

We can also use https://packagist.org/packages/symfony/polyfill-ctype instead.

@goetas
Copy link
Member

goetas commented Feb 3, 2020

Hm, that seems nice

@stof
Copy link
Contributor

stof commented Jun 9, 2021

@alecpl can you try by using \ord() in your first solution instead ? The ord() function is one of the compiler-optimized functions that gets a special opcode when its usage does not rely on the fallback to the global function. should not change anything here. The optimized case is when the argument is a static string, not a variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants