Invalid Character Error #23

miso-belica · 2014-01-28T09:46:52Z

Hi,
when I'm trying to parse URL http://e107.funsite.cz/ I get DOMException("Invalid Character Error", 5) because of one unclosed tag in the markup. The snippet below causes the exception. It is caused by trying to set attribute with name <div in DOMTreeBuilder.php. As I understand from the doc all errors should be recorded in property $dom->errors. Can you fix this please?

<div class="wrapper"
                <div class="fleft">

The text was updated successfully, but these errors were encountered:

mattfarina · 2014-01-30T18:51:58Z

@miso-belica thanks for sharing this. We'll take a look at it.

miso-belica · 2014-02-07T14:32:54Z

Another URL with the same issue http://hasicilitomysl.hys.cz and the snippet that causes the issue (unclosed <img> tag). TreeBuilder is trying to create attribute < that is part of </a> in the snippet.

<img border="1"
src="http://hasicilitomysl.hys.cz/wp-content/uploads/cesko.jpg"
title="CZ"</a>

And with little more context:

<div id="sidebar-wrap2">
     <ul id="sidelist">

        <li><div id="text-4" class="widget widget_text"><h2 class="title">Překladač</h2>          <div class="textwidget"><a href="http://hasicilitomysl.hys.cz" target="_blank"><img border="1"
src="http://hasicilitomysl.hys.cz/wp-content/uploads/cesko.jpg"
title="CZ"</a>

<a href="http://translate.google.cz/translate?sl=cs&tl=en&js=n&prev=_t&hl=cs&ie=UTF-8&layout=2&eotf=1&u=hasicilitomysl.hys.cz" target="_blank"><img border="1"
src="http://hasicilitomysl.hys.cz/wp-content/uploads/UK.jpg"
title="Anglie"</a>

mattfarina · 2014-02-11T01:51:32Z

@miso-belica Just wanted to let you know this is still being looked at. We've not forgotten about it.

miso-belica · 2014-02-11T09:16:03Z

Thanks, I also did some investigation and I find out that creating of attribute <div is the same behavior as in browsers and Python's html5lib. Problem here is, that the method DOMElement::setAttribute is throwing exception for character < in the $name. I tried it even with $document->strictErrorChecking = FALSE; but nothing changed.

So maybe the best solution would be just ignore attributes that contain invalid characters (throw DOMException). What do you think?

technosophos · 2014-02-11T17:00:32Z

Okay, added a fix in 77ad931.

This will basically treat <foo <bar> as <foo><bar>. For tags like <img <bar>, the normal short tag processing rules should work as usual.

miso-belica · 2014-02-12T15:44:16Z

Hi,
problem of this fix is that it works only for character <. But there are more characters to handle.

http://www.podnikatele.g6.cz/ - attribute with name 150" (" is problematic character)

<img src="http://www.podnikatele.g6.cz/wp-content/uploads/2013/07/button_zadost.png" width="190 height="150"></a></div>

http://www.meme.6f.sk/ - attribute name jazyku.”

<meta name=”description” content=”Vaše obľúbené meme v CZ/SK a ENG jazyku.”>

http://www.mapaj.cz - attributes with names ; and ?

<a class="rss-topnav" rel="nofollow" href="https://www.facebook.com/pages/Mapaj-Os/180837811986505?fref=ts‎" target="_blank"; ?>Najdete nás i na Facebooku</a>

http://l2represent.cz - attribute with name _blank"

<a href="http://www.l2top.co/vote/server/1148 target="_blank">

And unfortunately many more. So this fix is not the sufficient solution for me :(

mattfarina · 2014-02-12T15:49:50Z

@miso-belica thanks for giving us more detail.

technosophos · 2014-02-13T18:20:46Z

We need to make a big decision about how deeply we are going to go when it comes to supporting broken HTML.

There are also some really gray areas when it comes to how to deal with divergence from the spec. It's not clear, for example, whether ; and ? in the example 3 above should be attribute names or should be silently discarded. (Either answer violates the spec, AFAIK)

I'm not terribly keen on duplicating HTMLTidy, but I also don't want to have parser errors in cases where we can cleanly work around it.

mattfarina · 2014-02-13T18:37:27Z

Here's what I'm thinking....

The parser doesn't blow up when it hits bad html. It puts the errors in the right place.
We write up a quick wiki page documenting how to use HTMLTidy with it to fix all the broken things.

Thoughts?

technosophos · 2014-02-13T18:48:32Z

Plus...

If the spec does indeed specify how to handle a quirk, we do it.

technosophos · 2014-02-20T03:46:05Z

I am okay with Pull Request #29. I agree with @miso-belica that we should trap errors before they make their way into the DOM layer. If @mattfarina agrees, I will merge the pull request.

technosophos closed this as completed Feb 11, 2014

mattfarina reopened this Feb 12, 2014

miso-belica mentioned this issue Feb 19, 2014

Ignore attributes with illegal characters in name #29

Merged

technosophos closed this as completed in 8f95f4a Feb 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid Character Error #23

Invalid Character Error #23

miso-belica commented Jan 28, 2014

mattfarina commented Jan 30, 2014

miso-belica commented Feb 7, 2014

mattfarina commented Feb 11, 2014

miso-belica commented Feb 11, 2014

technosophos commented Feb 11, 2014

miso-belica commented Feb 12, 2014

mattfarina commented Feb 12, 2014

technosophos commented Feb 13, 2014

mattfarina commented Feb 13, 2014

technosophos commented Feb 13, 2014

technosophos commented Feb 20, 2014

Invalid Character Error #23

Invalid Character Error #23

Comments

miso-belica commented Jan 28, 2014

mattfarina commented Jan 30, 2014

miso-belica commented Feb 7, 2014

mattfarina commented Feb 11, 2014

miso-belica commented Feb 11, 2014

technosophos commented Feb 11, 2014

miso-belica commented Feb 12, 2014

mattfarina commented Feb 12, 2014

technosophos commented Feb 13, 2014

mattfarina commented Feb 13, 2014

technosophos commented Feb 13, 2014

technosophos commented Feb 20, 2014