Unescaped Output #19

ircmaxell · 2013-10-28T19:11:29Z

The serializer currently is not encoding data properly for output. This enables certain documents to be crafted which can expose XSS vulnerabilities. For example, the cdata serializer just outputs the text directly. When crafted with a malicious payload, this results in an attack vector:

$html = "<!DOCTYPE html>
<html>
 <head>
  <title>TEST</title>
 </head>
 <body id='foo'>
  <h1>Hello World</h1>
  <p>This is a test of the HTML5 parser.</p>
 </body>
</html>";

// Parse the document. $dom is a DOMDocument.
$dom = \HTML5::loadHTML($html);

$els = $dom->getElementsByTagName('h1');
$els->item(0)->appendChild(new DomCDataSection('this ]]><script>alert(hi!);</script><![CDATA[ is injected'));

var_dump(\HTML5::saveHTML($dom));

This will output:

<!DOCTYPE html>
<html><head>
  <title>TEST</title>
 </head>
 <body id="foo">
  <h1>Hello World<![CDATA[this ]]><script>alert(hi!);</script><![CDATA[ is     injected]]></h1>
  <p>This is a test of the HTML5 parser.</p>
 </body>
</html>

Which is obviously bad.

Comments and raw text fields both suffer a similar problem as well (from a quick glance).

The text was updated successfully, but these errors were encountered:

mattfarina · 2013-10-28T20:15:23Z

@ircmaxell thanks for the feedback.

I think this comes down to a philosophical issue. The HTML5 parser/serializer is meant to be similar to what PHP itself provides for parsing/serializing html but with support for HTML5. It's not meant to solve all problems. If I have end user content that always needs to be escaped properly but that is not meant to be part of this serializer. If someone needs that they need to use something like HTML Purifier or their other favorite library for doing this.

There are many use cases where the parser and serializer could be used in manner that have no user contributed material going through it.

All that being said, I want to make sure we lead people to intelligent and safe handling of user generated content. I'd like to get more feedback from some others but at the very least this should be added to the documentation.

Do you have any preferred libraries for escaping end user content?

If you are dealing

ircmaxell · 2013-10-28T20:22:50Z

Well, my preferred library is DomDocument. For example, calling ->saveXml() on that very object results in:

<?xml version="1.0"?>
<!DOCTYPE html>
<html><head>
  <title>TEST</title>
 </head>
 <body id="foo">
  <h1>Hello World<![CDATA[this ]]]]><![CDATA[><script>alert(hi!);</script><![CDATA[ is injected]]></h1>
  <p>This is a test of the HTML5 parser.</p>
 </body>
</html>

Which if you note, is correct (the <script> tag is rendered as a character data element, and not a tag).

So I guess where I'm going is that perhaps for at least CData and comment, you could render the cdata field using Dom:

return '<![CDATA[' . $dom->saveHtml($cdataBlock) . ']]>';

That will render it appropriately and fix this issue...

technosophos · 2013-10-28T20:25:40Z

I agree with @ircmaxell on this one. The serializer should correctly encode this stuff.

For CDATA, we should follow the XML serialization rules. For the rest we should follow the HTML5 rules.

mattfarina · 2013-10-28T20:26:01Z

I agree. I'll dig into this.

technosophos · 2013-10-28T20:35:15Z

For text, I think htmlencode($str, ENT_HTML5, 'UTF-8') is probably the correct encoding call, though we might have to switch to something else for PHP 5.3 and earlier.

For comments, we just need to replace --?> with --?>

For CDATA, we just replace ]]> with ]]]]><![CDATA[>.

Does that sounds right?

technosophos · 2013-10-28T20:35:55Z

Ha! GitHub converted my & gt ; to >

ircmaxell · 2013-10-28T20:36:35Z

I would be careful about doing things like string replacement. I would definitely prefer rendering out using DomDocument's render functionality, as that's spec driven...

technosophos · 2013-10-28T21:58:13Z

I'm getting some surprising results with the DOM library. CDATA escapes fine, but comments do not.

    $dom = \DOMDocument::loadHTML('<!doctype html>
    <html lang="en">
      <body>
        <div id="foo"></div>
      </body>
      </html>');
    $dom->getElementById('foo')->appendChild(new \DOMComment('<!-- --> --> Foo -->'));
    fprintf(STDOUT, $dom->saveXML());

produces

<!DOCTYPE html>
<html lang="en"><body>
        <div id="foo"><!--<!-- --> --> Foo -->--></div>
      </body></html>

(and saveHTML()) produces the same)

Finally, I can't reproduce any such thing from raw (PCDATA) text. @ircmaxell please provide an example that I can test with.

technosophos · 2013-10-28T22:16:18Z

And because now I am totally absorbed in this...

The HTML 5 spec says nothing about escaping comments. (Neither, it appears, does the XML spec):

http://www.w3.org/TR/2013/CR-html5-20130806/syntax.html#serializing-html-fragments

Compare the comments section to the section on text nodes.

It appears that to stay consistent with both DOM and with HTML5 spec, we should not do any escaping inside of HTML5 comments. While that can (clearly) lead to security issues, changing this would be more "non-standard". That said, the code I am about to check in will use saveXML() on comment nodes even though that produces exactly the same output as we were producing before.

Change to comments does not seem to produce different output under any cases. See Issue #19.

mattfarina · 2013-11-01T15:00:12Z

I believe this is done now that @technosophos has made some changes to the codebase to operate like the core DOMDocument and the addition of a wiki page.

ghost assigned mattfarina Oct 28, 2013

technosophos added a commit that referenced this issue Oct 28, 2013

Regression tests for #19.

772612d

technosophos added a commit that referenced this issue Oct 28, 2013

Changed serialization for CDATA and comments.

246ed77

Change to comments does not seem to produce different output under any cases. See Issue #19.

mattfarina closed this as completed Nov 1, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unescaped Output #19

Unescaped Output #19

ircmaxell commented Oct 28, 2013

mattfarina commented Oct 28, 2013

ircmaxell commented Oct 28, 2013

technosophos commented Oct 28, 2013

mattfarina commented Oct 28, 2013

technosophos commented Oct 28, 2013

technosophos commented Oct 28, 2013

ircmaxell commented Oct 28, 2013

technosophos commented Oct 28, 2013

technosophos commented Oct 28, 2013

mattfarina commented Nov 1, 2013