-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore attributes with illegal characters in name #29
Ignore attributes with illegal characters in name #29
Conversation
For reference, the attribute naming part of the spec is at http://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#attributes-0. I'm digging in to this. |
$this->parseError("Unexpected characters in attribute name: %s", $name); | ||
$isValidAttribute = FALSE; | ||
} | ||
else if (preg_match("/^[0-9.-]/u", $name)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are numbers not allowed in html5? In XML the starting character is different. This would apply to xhtml as well. In html5 I don't see that attribute name limitation.
It appears the DOM (provided by libxml) might have a limitation over html5.
I did some testing in chrome and firefox, to see some other implementation. They don't allow numbers as the first character in practice.
All of this is to say I did some legwork around numbers and I wanted to document it for the day we come back to this same issue. I'm ok with this for now. Though, I reserve the right to root @technosophos on as he writes a custom DOM some day in the far far future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the number regex can you provide a comment before it pointing to the specs and why it's filtering out the number as the first character. Knowing why this is here will cause some future questions, even if from me.
This mostly looks good. It seems there might be a separation between the DOM (from libxml) and the html5 spec. I'd just like some more documentation around the parser error handling code so that when we revisit this in the future we have a clue what's going on and why. |
This is neccesary because method "DOMElement::setAttribute" throws exception for wrong names so DOM elements can't contain these attributes.
Thanks for the review. I updated last commit by adding some doc. Hope it's OK now. |
The documentation is great. Thanks, @miso-belica, for not only finding the issue, but for putting in the effort to come up with a sensible workaround. I really appreciate your work. |
Ignore attributes with illegal characters in name
I'm glad I could help and thank you guys for this library. |
Hi,
I would like to fix the issue #23 ASAP, because I parse a lot of pages that throw
DOMException
related to it. I read @technosophos's opinion about the issue but I think there's nothing to think about because of currentDOMElement
's implementation in PHP. Reality is that the methodDOMElement::setAttribute
throws an exception so element with the attributes like?
,;
, ... can't be created. That's why I implemented the other possible way - parser ignores them.All I want is HTML parser that doesn't throw exceptions and believe we discuss to the proper solution :)