Invalid XHTML names entities reported in Character Data strings #800

JayPanoz · 2017-10-20T13:11:09Z

epubcheck seems to report invalid XHTML named entities in <![CDATA[]]>, which I used in the first place to indicate that the <script> data in between these strings includes data that could be interpreted as XML markup, but should not be.

Here comes the tricky part:

inline script is uglified/minified;
warning is reported for a super specific string: &&void 0 (there’s a space);
no issue with other strings such as &&TweenLite.to(), where there’s no space.

As far as I can remember, this wasn’t reported when the book was published last year (this is a 2nd edition). And given void 0 contains fewer characters than undefined, it’s quite used in uglyfiers/minifiers.

Fix if you encounter this issue: use void(0) instead of void 0 (they’re the same).

What’s strange though, is that it is interpreted as XML markup despite <![CDATA[]]>.

The text was updated successfully, but these errors were encountered:

tofi86 · 2017-11-21T21:26:31Z

Hey Jiminy, can you please attach a simple HTML file which can be run against the current EpubCheck build and also post the offending error message? Thanks!

JayPanoz · 2017-11-22T11:16:20Z

I’ll see what I can do (publishers’ right and stuff).

I guess I can just use a placeholder HTML file with the script if I don’t have permission to share the problematic XHTML file but won’t be able to deal with this until Friday.

tofi86 · 2017-11-22T11:28:51Z

A sample/dummy HTML file would be enough. Thanks Jiminy.

mattgarrish · 2017-11-22T14:18:20Z

On a cursory review, the problem looks like it's that EntitySearch.java goes line-by-line over an input stream looking for this pattern:

static final Pattern entityPattern = Pattern.compile("&([A-Za-z0-9]+)([;|\s])");

It doesn't care that there is an active CDATA block. So long as it finds an ampersand followed by one or more alphanumeric characters followed by a colon or space it will emit the warning. That's why '&&void ' gets a warning but '&&void(' does not.

mattgarrish · 2017-11-22T16:20:33Z

Even that pattern is a bit screwy, to be honest, as the pipe is not an OR inside a character set, and is not a delimiter of the end of an entity.

tofi86 · 2017-11-22T20:24:02Z

thanks matt for having a look.
regarding the screwy pattern, do you agree that it should be "&([A-Za-z0-9]+)(;|\s)" without the square brackets instead?

anyways, we still need a solution to not parse entitities in CDATA.

mattgarrish · 2017-11-22T23:34:37Z

Either that or just remove the pipe. The brackets establish a character set. I don't know if the capture groups are used or not, but they're also potentially superfluous. Could just be:

&[A-Za-z0-9]+[;\s]

Can't help with the harder cdata problem as I only hack at java. You probably have to pre-process out the data blocks before analyzing. Maybe also comments, but haven't tested those.

mattgarrish · 2017-11-23T00:49:14Z

Maybe also comments, but haven't tested those.

To confirm, it's dumb validation, so also affects comments.

I'm wondering why this check is even run? What is it doing that the default HTML validation doesn't catch?

I've yet to find any value to it, at any rate, other than to make mistakes. XHTML doesn't allow entities without a closing semi-colon, so that part of the check is useless, and declaring named entities beyond the base xml set generates errors, so you wind up with an error and a warning.

JayPanoz · 2017-11-24T17:40:33Z

Here we go, placeholder file in a ZIP (it contains an XHTML file but github doesn't support xhtml for attached files).

void-epubcheck.zip

kamorrissey · 2018-04-26T04:37:39Z

I'm not clear on what the resolution is. Is it to make sure that named entities are not checked for inside comments and CDATA?

kamorrissey · 2018-04-26T04:39:34Z

Oh, and to fix the regex for named entities?

kamorrissey · 2018-04-27T04:33:55Z

I reread this thread, and now I'm really not sure what the resolution is since there's talk that the entity check is redundant with a check elsewhere.

tofi86 · 2018-04-27T10:59:43Z

Since it was me adding the label "reviewed & ready for implementation", I have to wrap my head around it a second time I think...

I think I added the label because it's clearly a bug and we have demo data which are reproducable.
However, there's currently not a clear solution to it.

Looking at the EntitySearch.java class, this whole thing is a nightmare. As are other similar classes in the ctc package (see below).
This class tries to parse an X(HT)ML file line-by-line(!) which is a blocker for context sensitive information items like things INSIDE other elements/structures.

There's a similar parsing class in the ctc package HTMLTagsAnalyseHandler.java which handles XML validation like this and tries to implement context sensitive line-by-line parsing like at https://github.com/IDPF/epubcheck/blob/master/src/main/java/com/adobe/epubcheck/ctc/xml/HTMLTagsAnalyseHandler.java#L343 which is IMHO a huge pile of sh*** 😲 (and causes trouble at #446 as well).

All of this would need refactoring asap.

But I'm not sure whether that's a solution for this issue here.

I'm still on parental leave and my time is very limited at the moment, so unfortunately I cannot make up a good solution for this at the moment :-/ sorry.

kmorrissey-mersive · 2018-05-02T20:29:28Z

I'm thinking of tackling this one. What I'd like to do is to parse into individual tags (crudely, "<[^>]+>"; CDATA and comments require more sophistication than that) and non-tag (plain text) data. Then apply the character entity checking only to opening tags and non-tag data, ignoring comments, closing tags, and CDATA. I think the opening (& closed) tags can be searched naively.

Does that sound like a good approach?

Also, it seems to be a mistake to consider something to be a character entity if it doesn't end with a semicolon. Is part of the solution here to warn about instances of "&" that don't seem to begin a legal character or numeric entity? (I have to do my homework to see about my terminology.)

tofi86 · 2018-05-03T07:55:07Z

Also, it seems to be a mistake to consider something to be a character entity if it doesn't end with a semicolon.

Na, that's indeed the XML rules! So everything beginning with & needs to be treated as an entity. At least in XHTML. Out of my head, I don't know the rules for HTML5, but I would treat it the same.

Is part of the solution here to warn about instances of "&" that don't seem to begin a legal character or numeric entity?

The main goal should be to exclude entity parsing in CDATA blocks (the only place where an & is not treated as the beginning of an entity).

Parsing for legal entities is too hard I think. We could certainly check for decimal entities (Ӓ, can have 4+ digits), for hexadeciaml entities (&#202F;, can have 4+ chars) or named entities ( , there's a list of allowed entities in the (X)HTML DTD e.g. XHTML 1.1 DTD EPUB 2). But I don't know whether this is already checked somewhere else...

tofi86 · 2018-05-03T07:59:14Z

Does that sound like a good approach?

If you have a look at the linked TMLTagsAnalyseHandler.java above, you will see the marked line starting with if (inBlockQuote ... . Original Authors parsed line by line and if they found an opening tag <blockquote> they set inBlockQuote = true; and if they found the next occurance of closing </blockquote> the reset the var to false.

You could use this mechanism to parse for CDATA markers (start <![CDATA[, end ]]>) line by line and set a global var if you're inside a CDATA section.

If I got you right, that's basically what you suggested, right?

- suppress poorly implemented checks `HTM-023` and `HTM-024` (entity references are already checked at parsing time) - add more tests for entity references Fixes #800

tofi86 added the status: waiting for feedback The development team needs feedback from the issue’s creator label Nov 21, 2017

tofi86 added type: bug The issue describes a bug status: ready for implem The issue is ready to be implemented and removed status: waiting for feedback The development team needs feedback from the issue’s creator labels Nov 26, 2017

rdeltour mentioned this issue Jan 21, 2019

fix: do not check entity references in comment or CDATA #949

Merged

rdeltour added status: has PR The issue is being processed in a pull request and removed status: ready for implem The issue is ready to be implemented labels Jan 21, 2019

rdeltour added this to the 4.1.1 milestone Jan 21, 2019

rdeltour closed this as completed in #949 Jan 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid XHTML names entities reported in Character Data strings #800

Invalid XHTML names entities reported in Character Data strings #800

JayPanoz commented Oct 20, 2017

tofi86 commented Nov 21, 2017

JayPanoz commented Nov 22, 2017

tofi86 commented Nov 22, 2017

mattgarrish commented Nov 22, 2017

mattgarrish commented Nov 22, 2017

tofi86 commented Nov 22, 2017

mattgarrish commented Nov 22, 2017

mattgarrish commented Nov 23, 2017

JayPanoz commented Nov 24, 2017

kamorrissey commented Apr 26, 2018

kamorrissey commented Apr 26, 2018

kamorrissey commented Apr 27, 2018

tofi86 commented Apr 27, 2018

kmorrissey-mersive commented May 2, 2018 •

edited

Loading

tofi86 commented May 3, 2018

tofi86 commented May 3, 2018

Invalid XHTML names entities reported in Character Data strings #800

Invalid XHTML names entities reported in Character Data strings #800

Comments

JayPanoz commented Oct 20, 2017

tofi86 commented Nov 21, 2017

JayPanoz commented Nov 22, 2017

tofi86 commented Nov 22, 2017

mattgarrish commented Nov 22, 2017

mattgarrish commented Nov 22, 2017

tofi86 commented Nov 22, 2017

mattgarrish commented Nov 22, 2017

mattgarrish commented Nov 23, 2017

JayPanoz commented Nov 24, 2017

kamorrissey commented Apr 26, 2018

kamorrissey commented Apr 26, 2018

kamorrissey commented Apr 27, 2018

tofi86 commented Apr 27, 2018

kmorrissey-mersive commented May 2, 2018 • edited Loading

tofi86 commented May 3, 2018

tofi86 commented May 3, 2018

kmorrissey-mersive commented May 2, 2018 •

edited

Loading