Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The DOMDocument returned by the HTML5 Parser does not hae an encoding set #167

Closed
idimopoulos opened this issue Jun 14, 2019 · 1 comment

Comments

@idimopoulos
Copy link
Contributor

I am not sure if this should be fixed here or in Symfony, but due to my following assumption I opened the issue here for further discussion.
After symfony/symfony#29306, Symfony requires this library in order to support HTML5 parsing. The problem is that now there are two methods in Symfony, the new

private function parseHtml5(string $htmlContent, string $charset = 'UTF-8'): \DOMDocument
    {
        return $this->html5Parser->parse($this->convertToHtmlEntities($htmlContent, $charset), [], $charset);
    }

And the old

 private function parseXhtml(string $htmlContent, string $charset = 'UTF-8'): \DOMDocument
    {
        .
        .
        $dom = new \DOMDocument('1.0', $charset);
        $dom->validateOnParse = true;
        .
        .
        return $dom;
    }

So here comes my assumption. In the HTML5 parser's ::parse there is the following code

        $events = new DOMTreeBuilder(false, $options);

And in the constructor of the \Masterminds\HTML5\Parser\DOMTreeBuilder::__construct the way to build the DOMDocument is by creating an instance of \DOMImplementation which does not instantiate the encoding unlike the new \DOMDocument('1.0', $charset); a bit further above.

Now, despite the legacy support in Symfony using the \DOMDocument object to directly instantiate the $dom object, in the \Masterminds\HTML5::parse method the scanner that helps parsing the HTML5 document is instantiated as such:

new Scanner($input, !empty($options['encoding']) ? $options['encoding'] : 'UTF-8');

So by default, the 'UTF-8' encoding is used to instantiate the scanner.

My though was that since the parser is using the UTF-8 format by default, shouldn't the

   $events = new DOMTreeBuilder(false, $options);

also set the encoding of the \DOMDocument object to 'UTF-8' by default?

@goetas
Copy link
Member

goetas commented Jun 15, 2019

fixed in #168

@goetas goetas closed this as completed Jun 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants