Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

noscript in head should allow link, meta, style tags and in body other tags #98

Closed
sidkshatriya opened this issue Mar 14, 2016 · 10 comments
Labels

Comments

@sidkshatriya
Copy link

The <noscript> tag in <head> treats everything it contains as raw text. In fact it says in https://html.spec.whatwg.org/multipage/scripting.html#the-noscript-element that noscript can contain the link, meta and style tags.

If I load a trivial html document

$html = '<html><head>
<noscript><style>a { color: red;}</style></noscript>
</head><body></body></html>';

// Using QueryPath
$qp = qp($html);
print($qp->top()->html5());

I get an ugly PHP notice:

Notice: Undefined property: DOMElement::$data in 
Masterminds\HTML5\Serializer\OutputRules->element() (line 222 of 
/var/www/amp/docroot/vendor/masterminds/html5/src/HTML5/Serializer/OutputRules.php

This PHP notice is actually not benign. All the contents of the <noscript> tag are lost and we get the output:

<html><head>
<noscript></noscript>
</head><body></body></html>

It might be a good idea to load the html like this:

$h5 = new HTML5();
$dom_doc = $h5->loadHTML($html);
$qp = qp($dom_doc);
print($qp->top()->html5());

But this does not serve my purpose because the DOM is now loaded with only a <noscript> tag and everything underneath is a string. So you cannot query, manipulate the inner <style> tag.

There is a kludge in dealing with the above problem (and its inefficient if you have large documents):

$qp = qp($html);
// do manipulations on the DOM
...
...
// note the use of html instead of html5()
$html_output = $qp->top()->html();
$h5 = new HTML5();
$dom_doc = $h5->loadHTML($html_output);
$qp = qp($dom_doc);
print($qp->top()->html5());

Could we fix the Mastermind code so we don't have to adopt the kludge?

Is there a smarter way to fix this problem without having to fix mastermind code?

This functionality is important for manipulating AMP HTML documents which always have some tags in the <head> <noscript>

@sidkshatriya
Copy link
Author

I suppose another approach would be to create a subclass of the

\Masterminds\HTML5\Serializer\OutputRules class and override the element($ele) method. But the HTML5 library seems to hardcode the instantiation of the OutputRules class in the \Masterminds\HTML5::save method.

@mattfarina mattfarina added the bug label Mar 14, 2016
@mattfarina
Copy link
Member

For additional reference, the HTML5 spec notes this as well in https://www.w3.org/TR/html5/scripting-1.html#the-noscript-element

@sidkshatriya
Copy link
Author

Just FYI we're using the Mastermind HTML5 library (and QueryPath) in the AMP PHP Library

@sidkshatriya sidkshatriya changed the title noscript in head should allow link, meta, style noscript in head should allow link, meta, style tags and in body other tags Mar 15, 2016
@sidkshatriya
Copy link
Author

This is general issue with the <noscript> tag and not only when the tag is in <head>. In body, the <noscript> tag can contain most HTML tags. (In body, these tags will be displayed if javascript is turned off). Mastermind HTML5 treats these nested tags as raw text also so we have the same problem as we explained above.

There is another workaround:

$html = '<html><head>
<noscript><style>a { color: red;}</style></noscript>
</head><body></body></html>';

// Using QueryPath
$qp = qp($html);
// do DOM manipulation
...
...
// Now its time to change the noscript tags to contain text (just before output)
$noscripts = $qp->find('noscript');
foreach($noscripts as $noscript) {
   // If you care about CDATA not appearing then QueryPath's innerHTML() might prove 
   // problematic but you can implement your own innerHTML method by using saveHTML() 
   // call on DOMNodes.
   $innerhtml = $noscript->innerHTML();
   $noscript->removeChildren();
   // now all nested children are just text
   $noscript->text($innerhtml);
}

// All noscript nested children are text so html5() function call will not give problems.
print($qp->top()->html5());

@mattfarina
Copy link
Member

In the body when scripting is disabled (case we need to shoot for) the noscript tag is transparent to the DOM. So, yeah, this is a bug.

Pull requests are welcome and if not I'll try to look at this soon.

@sidkshatriya
Copy link
Author

@mattfarina It would probably take me some time to understand the internals of the module. So if you're able to get some time to fix this it would be very much appreciated :-)

[Off topic: If possible can we have a release for the 2.x series -- the last release was some time ago.. Also what is the difference between the 2.x branch and master branch?]

@mattfarina
Copy link
Member

My plan is to cut a release after fixing this bug.

The master branch meets all the PSRs. There were some naming things that are breaking (e.g., Html vs HTML). That will be the basis for a 3.x series when time becomes available to make a handful of changes.

@mattfarina
Copy link
Member

I'm still reading through the HTML5 spec on this. There is some different direction in different sections. In particular on tags allowed in the noscript tags.

Since this is not a validating parser I'm tempted to make noscript transparent to the model while throwing a parse error for nested noscript and some other cases. See 8.2.5.4.5 of the spec for more detail.

@sidkshatriya
Copy link
Author

I think thats fair -- just check if there are any nested noscripts (and any other cases you may care). Would it be a fatal parse error?

It would probably be OK if you didn't do anything at all (didn't check for nesting etc). Since this is not a validating parser why do you care to validate at all here? This would actually be more consistent...

@mattfarina
Copy link
Member

@sidkshatriya Can you take a look at the PR on #99.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants