Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unescaped Output #19

Closed
ircmaxell opened this issue Oct 28, 2013 · 10 comments
Closed

Unescaped Output #19

ircmaxell opened this issue Oct 28, 2013 · 10 comments
Assignees

Comments

@ircmaxell
Copy link

The serializer currently is not encoding data properly for output. This enables certain documents to be crafted which can expose XSS vulnerabilities. For example, the cdata serializer just outputs the text directly. When crafted with a malicious payload, this results in an attack vector:

$html = "<!DOCTYPE html>
<html>
 <head>
  <title>TEST</title>
 </head>
 <body id='foo'>
  <h1>Hello World</h1>
  <p>This is a test of the HTML5 parser.</p>
 </body>
</html>";

// Parse the document. $dom is a DOMDocument.
$dom = \HTML5::loadHTML($html);

$els = $dom->getElementsByTagName('h1');
$els->item(0)->appendChild(new DomCDataSection('this ]]><script>alert(hi!);</script><![CDATA[ is injected'));

var_dump(\HTML5::saveHTML($dom));

This will output:

<!DOCTYPE html>
<html><head>
  <title>TEST</title>
 </head>
 <body id="foo">
  <h1>Hello World<![CDATA[this ]]><script>alert(hi!);</script><![CDATA[ is     injected]]></h1>
  <p>This is a test of the HTML5 parser.</p>
 </body>
</html>

Which is obviously bad.

Comments and raw text fields both suffer a similar problem as well (from a quick glance).

@mattfarina
Copy link
Member

@ircmaxell thanks for the feedback.

I think this comes down to a philosophical issue. The HTML5 parser/serializer is meant to be similar to what PHP itself provides for parsing/serializing html but with support for HTML5. It's not meant to solve all problems. If I have end user content that always needs to be escaped properly but that is not meant to be part of this serializer. If someone needs that they need to use something like HTML Purifier or their other favorite library for doing this.

There are many use cases where the parser and serializer could be used in manner that have no user contributed material going through it.

All that being said, I want to make sure we lead people to intelligent and safe handling of user generated content. I'd like to get more feedback from some others but at the very least this should be added to the documentation.

Do you have any preferred libraries for escaping end user content?

If you are dealing

@ircmaxell
Copy link
Author

Well, my preferred library is DomDocument. For example, calling ->saveXml() on that very object results in:

<?xml version="1.0"?>
<!DOCTYPE html>
<html><head>
  <title>TEST</title>
 </head>
 <body id="foo">
  <h1>Hello World<![CDATA[this ]]]]><![CDATA[><script>alert(hi!);</script><![CDATA[ is injected]]></h1>
  <p>This is a test of the HTML5 parser.</p>
 </body>
</html>

Which if you note, is correct (the <script> tag is rendered as a character data element, and not a tag).

So I guess where I'm going is that perhaps for at least CData and comment, you could render the cdata field using Dom:

return '<![CDATA[' . $dom->saveHtml($cdataBlock) . ']]>';

That will render it appropriately and fix this issue...

@technosophos
Copy link
Member

I agree with @ircmaxell on this one. The serializer should correctly encode this stuff.

For CDATA, we should follow the XML serialization rules. For the rest we should follow the HTML5 rules.

@mattfarina
Copy link
Member

I agree. I'll dig into this.

@ghost ghost assigned mattfarina Oct 28, 2013
@technosophos
Copy link
Member

For text, I think htmlencode($str, ENT_HTML5, 'UTF-8') is probably the correct encoding call, though we might have to switch to something else for PHP 5.3 and earlier.

For comments, we just need to replace --?> with --?>

For CDATA, we just replace ]]> with ]]]]><![CDATA[>.

Does that sounds right?

@technosophos
Copy link
Member

Ha! GitHub converted my & gt ; to >

@ircmaxell
Copy link
Author

I would be careful about doing things like string replacement. I would definitely prefer rendering out using DomDocument's render functionality, as that's spec driven...

@technosophos
Copy link
Member

I'm getting some surprising results with the DOM library. CDATA escapes fine, but comments do not.

    $dom = \DOMDocument::loadHTML('<!doctype html>
    <html lang="en">
      <body>
        <div id="foo"></div>
      </body>
      </html>');
    $dom->getElementById('foo')->appendChild(new \DOMComment('<!-- --> --> Foo -->'));
    fprintf(STDOUT, $dom->saveXML());

produces

<!DOCTYPE html>
<html lang="en"><body>
        <div id="foo"><!--<!-- --> --> Foo -->--></div>
      </body></html>

(and saveHTML()) produces the same)

Finally, I can't reproduce any such thing from raw (PCDATA) text. @ircmaxell please provide an example that I can test with.

@technosophos
Copy link
Member

And because now I am totally absorbed in this...

The HTML 5 spec says nothing about escaping comments. (Neither, it appears, does the XML spec):

http://www.w3.org/TR/2013/CR-html5-20130806/syntax.html#serializing-html-fragments

Compare the comments section to the section on text nodes.

It appears that to stay consistent with both DOM and with HTML5 spec, we should not do any escaping inside of HTML5 comments. While that can (clearly) lead to security issues, changing this would be more "non-standard". That said, the code I am about to check in will use saveXML() on comment nodes even though that produces exactly the same output as we were producing before.

technosophos added a commit that referenced this issue Oct 28, 2013
technosophos added a commit that referenced this issue Oct 28, 2013
Change to comments does not seem to produce different output under
any cases. See Issue #19.
@mattfarina
Copy link
Member

I believe this is done now that @technosophos has made some changes to the codebase to operate like the core DOMDocument and the addition of a wiki page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants