Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Character Error #23

Closed
miso-belica opened this issue Jan 28, 2014 · 11 comments
Closed

Invalid Character Error #23

miso-belica opened this issue Jan 28, 2014 · 11 comments

Comments

@miso-belica
Copy link
Contributor

Hi,
when I'm trying to parse URL http://e107.funsite.cz/ I get DOMException("Invalid Character Error", 5) because of one unclosed tag in the markup. The snippet below causes the exception. It is caused by trying to set attribute with name <div in DOMTreeBuilder.php. As I understand from the doc all errors should be recorded in property $dom->errors. Can you fix this please?

<div class="wrapper"
                <div class="fleft">
@mattfarina
Copy link
Member

@miso-belica thanks for sharing this. We'll take a look at it.

@miso-belica
Copy link
Contributor Author

Another URL with the same issue http://hasicilitomysl.hys.cz and the snippet that causes the issue (unclosed <img> tag). TreeBuilder is trying to create attribute < that is part of </a> in the snippet.

<img border="1"
src="http://hasicilitomysl.hys.cz/wp-content/uploads/cesko.jpg"
title="CZ"</a>

And with little more context:

<div id="sidebar-wrap2">
     <ul id="sidelist">

        <li><div id="text-4" class="widget widget_text"><h2 class="title">Překladač</h2>          <div class="textwidget"><a href="http://hasicilitomysl.hys.cz" target="_blank"><img border="1"
src="http://hasicilitomysl.hys.cz/wp-content/uploads/cesko.jpg"
title="CZ"</a>

<a href="http://translate.google.cz/translate?sl=cs&tl=en&js=n&prev=_t&hl=cs&ie=UTF-8&layout=2&eotf=1&u=hasicilitomysl.hys.cz" target="_blank"><img border="1"
src="http://hasicilitomysl.hys.cz/wp-content/uploads/UK.jpg"
title="Anglie"</a>

@mattfarina
Copy link
Member

@miso-belica Just wanted to let you know this is still being looked at. We've not forgotten about it.

@miso-belica
Copy link
Contributor Author

Thanks, I also did some investigation and I find out that creating of attribute <div is the same behavior as in browsers and Python's html5lib. Problem here is, that the method DOMElement::setAttribute is throwing exception for character < in the $name. I tried it even with $document->strictErrorChecking = FALSE; but nothing changed.

So maybe the best solution would be just ignore attributes that contain invalid characters (throw DOMException). What do you think?

@technosophos
Copy link
Member

Okay, added a fix in 77ad931.

This will basically treat <foo <bar> as <foo><bar>. For tags like <img <bar>, the normal short tag processing rules should work as usual.

@miso-belica
Copy link
Contributor Author

Hi,
problem of this fix is that it works only for character <. But there are more characters to handle.

<img src="http://www.podnikatele.g6.cz/wp-content/uploads/2013/07/button_zadost.png" width="190 height="150"></a></div>
<meta name=”description” content=”Vaše obľúbené meme v CZ/SK a ENG jazyku.”>
<a class="rss-topnav" rel="nofollow" href="https://www.facebook.com/pages/Mapaj-Os/180837811986505?fref=ts‎" target="_blank"; ?>Najdete nás i na Facebooku</a>
<a href="http://www.l2top.co/vote/server/1148 target="_blank">

And unfortunately many more. So this fix is not the sufficient solution for me :(

@mattfarina
Copy link
Member

@miso-belica thanks for giving us more detail.

@mattfarina mattfarina reopened this Feb 12, 2014
@technosophos
Copy link
Member

We need to make a big decision about how deeply we are going to go when it comes to supporting broken HTML.

There are also some really gray areas when it comes to how to deal with divergence from the spec. It's not clear, for example, whether ; and ? in the example 3 above should be attribute names or should be silently discarded. (Either answer violates the spec, AFAIK)

I'm not terribly keen on duplicating HTMLTidy, but I also don't want to have parser errors in cases where we can cleanly work around it.

@mattfarina
Copy link
Member

Here's what I'm thinking....

  1. The parser doesn't blow up when it hits bad html. It puts the errors in the right place.
  2. We write up a quick wiki page documenting how to use HTMLTidy with it to fix all the broken things.

Thoughts?

@technosophos
Copy link
Member

Plus...

  1. If the spec does indeed specify how to handle a quirk, we do it.

@technosophos
Copy link
Member

I am okay with Pull Request #29. I agree with @miso-belica that we should trap errors before they make their way into the DOM layer. If @mattfarina agrees, I will merge the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants