-
Notifications
You must be signed in to change notification settings - Fork 408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid XHTML names entities reported in Character Data strings #800
Comments
Hey Jiminy, can you please attach a simple HTML file which can be run against the current EpubCheck build and also post the offending error message? Thanks! |
I’ll see what I can do (publishers’ right and stuff). I guess I can just use a placeholder HTML file with the script if I don’t have permission to share the problematic XHTML file but won’t be able to deal with this until Friday. |
A sample/dummy HTML file would be enough. Thanks Jiminy. |
On a cursory review, the problem looks like it's that EntitySearch.java goes line-by-line over an input stream looking for this pattern: static final Pattern entityPattern = Pattern.compile("&([A-Za-z0-9]+)([;|\s])"); It doesn't care that there is an active CDATA block. So long as it finds an ampersand followed by one or more alphanumeric characters followed by a colon or space it will emit the warning. That's why '&&void ' gets a warning but '&&void(' does not. |
Even that pattern is a bit screwy, to be honest, as the pipe is not an OR inside a character set, and is not a delimiter of the end of an entity. |
thanks matt for having a look. anyways, we still need a solution to not parse entitities in CDATA. |
Either that or just remove the pipe. The brackets establish a character set. I don't know if the capture groups are used or not, but they're also potentially superfluous. Could just be:
Can't help with the harder cdata problem as I only hack at java. You probably have to pre-process out the data blocks before analyzing. Maybe also comments, but haven't tested those. |
To confirm, it's dumb validation, so also affects comments. I'm wondering why this check is even run? What is it doing that the default HTML validation doesn't catch? I've yet to find any value to it, at any rate, other than to make mistakes. XHTML doesn't allow entities without a closing semi-colon, so that part of the check is useless, and declaring named entities beyond the base xml set generates errors, so you wind up with an error and a warning. |
Here we go, placeholder file in a ZIP (it contains an XHTML file but github doesn't support xhtml for attached files). |
I'm not clear on what the resolution is. Is it to make sure that named entities are not checked for inside comments and CDATA? |
Oh, and to fix the regex for named entities? |
I reread this thread, and now I'm really not sure what the resolution is since there's talk that the entity check is redundant with a check elsewhere. |
Since it was me adding the label "reviewed & ready for implementation", I have to wrap my head around it a second time I think... I think I added the label because it's clearly a bug and we have demo data which are reproducable. Looking at the There's a similar parsing class in the ctc package All of this would need refactoring asap. But I'm not sure whether that's a solution for this issue here. I'm still on parental leave and my time is very limited at the moment, so unfortunately I cannot make up a good solution for this at the moment :-/ sorry. |
I'm thinking of tackling this one. What I'd like to do is to parse into individual tags (crudely, "<[^>]+>"; CDATA and comments require more sophistication than that) and non-tag (plain text) data. Then apply the character entity checking only to opening tags and non-tag data, ignoring comments, closing tags, and CDATA. I think the opening (& closed) tags can be searched naively. Does that sound like a good approach? Also, it seems to be a mistake to consider something to be a character entity if it doesn't end with a semicolon. Is part of the solution here to warn about instances of "&" that don't seem to begin a legal character or numeric entity? (I have to do my homework to see about my terminology.) |
Na, that's indeed the XML rules! So everything beginning with
The main goal should be to exclude entity parsing in CDATA blocks (the only place where an Parsing for legal entities is too hard I think. We could certainly check for decimal entities ( |
If you have a look at the linked You could use this mechanism to parse for CDATA markers (start If I got you right, that's basically what you suggested, right? |
- suppress poorly implemented checks `HTM-023` and `HTM-024` (entity references are already checked at parsing time) - add more tests for entity references Fixes #800
- suppress poorly implemented checks `HTM-023` and `HTM-024` (entity references are already checked at parsing time) - add more tests for entity references Fixes #800
epubcheck seems to report invalid XHTML named entities in
<![CDATA[]]>
, which I used in the first place to indicate that the<script>
data in between these strings includes data that could be interpreted as XML markup, but should not be.Here comes the tricky part:
&&void 0
(there’s a space);&&TweenLite.to()
, where there’s no space.As far as I can remember, this wasn’t reported when the book was published last year (this is a 2nd edition). And given
void 0
contains fewer characters thanundefined
, it’s quite used in uglyfiers/minifiers.Fix if you encounter this issue: use
void(0)
instead ofvoid 0
(they’re the same).What’s strange though, is that it is interpreted as XML markup despite
<![CDATA[]]>
.The text was updated successfully, but these errors were encountered: