The DOMDocument returned by the HTML5 Parser does not hae an encoding set #167

idimopoulos · 2019-06-14T10:02:57Z

I am not sure if this should be fixed here or in Symfony, but due to my following assumption I opened the issue here for further discussion.
After symfony/symfony#29306, Symfony requires this library in order to support HTML5 parsing. The problem is that now there are two methods in Symfony, the new

private function parseHtml5(string $htmlContent, string $charset = 'UTF-8'): \DOMDocument
    {
        return $this->html5Parser->parse($this->convertToHtmlEntities($htmlContent, $charset), [], $charset);
    }

And the old

 private function parseXhtml(string $htmlContent, string $charset = 'UTF-8'): \DOMDocument
    {
        .
        .
        $dom = new \DOMDocument('1.0', $charset);
        $dom->validateOnParse = true;
        .
        .
        return $dom;
    }

So here comes my assumption. In the HTML5 parser's ::parse there is the following code

        $events = new DOMTreeBuilder(false, $options);

And in the constructor of the \Masterminds\HTML5\Parser\DOMTreeBuilder::__construct the way to build the DOMDocument is by creating an instance of \DOMImplementation which does not instantiate the encoding unlike the new \DOMDocument('1.0', $charset); a bit further above.

Now, despite the legacy support in Symfony using the \DOMDocument object to directly instantiate the $dom object, in the \Masterminds\HTML5::parse method the scanner that helps parsing the HTML5 document is instantiated as such:

new Scanner($input, !empty($options['encoding']) ? $options['encoding'] : 'UTF-8');

So by default, the 'UTF-8' encoding is used to instantiate the scanner.

My though was that since the parser is using the UTF-8 format by default, shouldn't the

   $events = new DOMTreeBuilder(false, $options);

also set the encoding of the \DOMDocument object to 'UTF-8' by default?

The text was updated successfully, but these errors were encountered:

goetas · 2019-06-15T12:50:33Z

fixed in #168

idimopoulos mentioned this issue Jun 14, 2019

Set default encoding in the DOMDocument object #168

Merged

goetas closed this as completed Jun 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The DOMDocument returned by the HTML5 Parser does not hae an encoding set #167

The DOMDocument returned by the HTML5 Parser does not hae an encoding set #167

idimopoulos commented Jun 14, 2019

goetas commented Jun 15, 2019

The DOMDocument returned by the HTML5 Parser does not hae an encoding set #167

The DOMDocument returned by the HTML5 Parser does not hae an encoding set #167

Comments

idimopoulos commented Jun 14, 2019

goetas commented Jun 15, 2019