-
-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace_entities ignored for SAX parser (nokogiri 1.5.0-java / jruby 1.6.6) #614
Comments
Did some digging; it looks like At a bit of a loss here; maybe the solution is to write a custom entityResolver that just 'resolves' the entity back to its original content? Not sure if that'll throw the parser into a loop... Looks like I'm not alone; if the JRuby author is having a hard time getting Xerces to play ball, I don't feel so bad: http://groups.google.com/group/nokogiri-talk/browse_thread/thread/bbd4319811a4014e?fwc=1 |
Thanks for digging in this issue. I'll have a look. |
I don't think that's relevant anymore and I can't tell due to the lack of reproducible code. It seems to me that the JRuby implementation is passing all the sax parser tests that were referenced on the mailing list, so i'll go ahead and assume that the issue was fixed. Please reopen if you feel that the issue isn't fixed and have a code snippet that reproduces your problem. Cheers, |
Reopening this, uncovered this bug again while investigating #1926. I've got a fix, and will close this once that PR is green and merged. |
On CRuby, this fixes the fact that the parser was registering errors when encountering general (non-predefined) entities. Now these entities are resolved properly and converted into `#characters` callbacks. Fixes #1926. On JRuby, the SAX parser now respects the `#replace_entities` attribute, which was previously ignored AND defaulted incorrectly to `true`. The default now matches CRuby -- `false` -- and the parser behavior matches CRuby with respect to entities. Fixes #614. This commit also includes some granular tests of how the sax parser handles different entities under different circumstances, which should be clarifying for user reports like #1284 and #1500 that expect predefined entities and character references to be treated like parsed entities (which they aren't).
This is fixed in #3265. |
**What problem is this PR intended to solve?** #1926 described an issue wherein the SAX parser was not correctly resolving and replacing internal entities, and was instead reporting an error for each entity reference. This PR includes a fix for that problem. I've removed the unnecessary "SAX tuple" from the SAX implementation, replacing it with the `_private` struct member that libxml2 makes available. Then I set up the parser context structs so that we can use libxml2's standard SAX callbacks where they're useful (which is how I addressed the above issue). This PR also introduces a new feature, a SAX handler callback `Document#reference` which allows callers to get entity-specific name and replacement text information (rather than relying on the `Document#characters` callback). This can be used to solve the original issue in #1926 with this code: searls/eiwa#11 The behavior of the SAX parser with respect to entities is complex enough that I wrote up a short doc in the `XML::SAX::Document` docstring with a table and explanation. I've also added warnings to remind users that `#replace_entities` is not safe to set when parsing untrusted documents. In the Java implementation, I've fixed the `#replace_entities` option in the SAX parser context and set it to the proper default (`false`), fixing #614. I've also corrected the value of the URI argument to `Document#start_element_namespace` which was a blank string when it should have been `nil`. I've added quite a bit of testing around the SAX parser's handling of entities. I added and clarified quite a bit of documentation around SAX parsing generally. Exception messages have been clarified in a couple of places, and made consistent between the C and Java implementations. This should address questions asked in issues #1500 and #1284. Finally, I cleaned up some of the C code that implements SAX parsing, naming functions more explicitly (and moving towards some kind of standard naming convention). Closes #1926. Closes #614. **Have you included adequate test coverage?** Yes! **Does this change affect the behavior of either the C or the Java implementations?** Yes, but the implementations are much more consistent with each other now.
I'm attempting to use a SAX parser on a very large XML document with a custom DTD, and don't want to replace entities with their DTD-defined values.
I can't see a way with Nokogiri::XML::SAX::Document to specify ParserOptions (to give a go with Nokogiri::XML::ParseOptions::NOENT), as the initializer for SAX::Document doesn't seem to take any arguments, and when I try to set replace_entities, it just seems to be ignored:
Attempting to parse XML with this throws an error:
Which seems to indicate that entities are being processed.
Digging into the extension code, I see a
private IRubyObject replaceEntities;
inext/java/nokogiri/XmlSaxParserContext.java
, and this appears to get set by the 'replace_entities' methods, but from there I can't see howreplaceEntities
gets applied -- there's nothing about it in the Xerces docs, and it's not referenced in any other file.Am I missing something?
If not, maybe the way to go about this is to use setFeature on the Xerces parser? Looking at https://xerces.apache.org/xerces2-j/features.html, using one of these might do the trick:
I'll have a go with these later on, assuming nobody chimes in to point out where I missed something...
The text was updated successfully, but these errors were encountered: