-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unescaped Output #19
Comments
@ircmaxell thanks for the feedback. I think this comes down to a philosophical issue. The HTML5 parser/serializer is meant to be similar to what PHP itself provides for parsing/serializing html but with support for HTML5. It's not meant to solve all problems. If I have end user content that always needs to be escaped properly but that is not meant to be part of this serializer. If someone needs that they need to use something like HTML Purifier or their other favorite library for doing this. There are many use cases where the parser and serializer could be used in manner that have no user contributed material going through it. All that being said, I want to make sure we lead people to intelligent and safe handling of user generated content. I'd like to get more feedback from some others but at the very least this should be added to the documentation. Do you have any preferred libraries for escaping end user content? If you are dealing |
Well, my preferred library is DomDocument. For example, calling
Which if you note, is correct (the So I guess where I'm going is that perhaps for at least CData and comment, you could render the
That will render it appropriately and fix this issue... |
I agree with @ircmaxell on this one. The serializer should correctly encode this stuff. For CDATA, we should follow the XML serialization rules. For the rest we should follow the HTML5 rules. |
I agree. I'll dig into this. |
For text, I think htmlencode($str, ENT_HTML5, 'UTF-8') is probably the correct encoding call, though we might have to switch to something else for PHP 5.3 and earlier. For comments, we just need to replace --?> with --?> For CDATA, we just replace ]]> with ]]]]><![CDATA[>. Does that sounds right? |
Ha! GitHub converted my & gt ; to > |
I would be careful about doing things like string replacement. I would definitely prefer rendering out using DomDocument's render functionality, as that's spec driven... |
I'm getting some surprising results with the DOM library. CDATA escapes fine, but comments do not. $dom = \DOMDocument::loadHTML('<!doctype html>
<html lang="en">
<body>
<div id="foo"></div>
</body>
</html>');
$dom->getElementById('foo')->appendChild(new \DOMComment('<!-- --> --> Foo -->'));
fprintf(STDOUT, $dom->saveXML()); produces <!DOCTYPE html>
<html lang="en"><body>
<div id="foo"><!--<!-- --> --> Foo -->--></div>
</body></html> (and saveHTML()) produces the same) Finally, I can't reproduce any such thing from raw (PCDATA) text. @ircmaxell please provide an example that I can test with. |
And because now I am totally absorbed in this... The HTML 5 spec says nothing about escaping comments. (Neither, it appears, does the XML spec): http://www.w3.org/TR/2013/CR-html5-20130806/syntax.html#serializing-html-fragments Compare the comments section to the section on text nodes. It appears that to stay consistent with both DOM and with HTML5 spec, we should not do any escaping inside of HTML5 comments. While that can (clearly) lead to security issues, changing this would be more "non-standard". That said, the code I am about to check in will use saveXML() on comment nodes even though that produces exactly the same output as we were producing before. |
Change to comments does not seem to produce different output under any cases. See Issue #19.
I believe this is done now that @technosophos has made some changes to the codebase to operate like the core DOMDocument and the addition of a wiki page. |
The serializer currently is not encoding data properly for output. This enables certain documents to be crafted which can expose XSS vulnerabilities. For example, the cdata serializer just outputs the text directly. When crafted with a malicious payload, this results in an attack vector:
This will output:
Which is obviously bad.
Comments and raw text fields both suffer a similar problem as well (from a quick glance).
The text was updated successfully, but these errors were encountered: