Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace_entities ignored for SAX parser (nokogiri 1.5.0-java / jruby 1.6.6) #614

Closed
matadon opened this issue Feb 10, 2012 · 5 comments · Fixed by #3265
Closed

replace_entities ignored for SAX parser (nokogiri 1.5.0-java / jruby 1.6.6) #614

matadon opened this issue Feb 10, 2012 · 5 comments · Fixed by #3265

Comments

@matadon
Copy link

matadon commented Feb 10, 2012

I'm attempting to use a SAX parser on a very large XML document with a custom DTD, and don't want to replace entities with their DTD-defined values.

I can't see a way with Nokogiri::XML::SAX::Document to specify ParserOptions (to give a go with Nokogiri::XML::ParseOptions::NOENT), as the initializer for SAX::Document doesn't seem to take any arguments, and when I try to set replace_entities, it just seems to be ignored:

 file = "myfile.xml"
document = NokogiriXMLSAXDocumentSubclass.new
parser = Nokogiri::XML::SAX::Parser.new(document)
parser.parse_file(file) do |context|
    context.replace_entities = false
end

Attempting to parse XML with this throws an error:

The entity "entityname" was referenced, but not declared.

Which seems to indicate that entities are being processed.

Digging into the extension code, I see a private IRubyObject replaceEntities; in ext/java/nokogiri/XmlSaxParserContext.java, and this appears to get set by the 'replace_entities' methods, but from there I can't see how replaceEntities gets applied -- there's nothing about it in the Xerces docs, and it's not referenced in any other file.

Am I missing something?

If not, maybe the way to go about this is to use setFeature on the Xerces parser? Looking at https://xerces.apache.org/xerces2-j/features.html, using one of these might do the trick:

http://xml.org/sax/features/use-entity-resolver2
http://apache.org/xml/features/validation/unparsed-entity-checking

I'll have a go with these later on, assuming nobody chimes in to point out where I missed something...

@matadon
Copy link
Author

matadon commented Feb 10, 2012

Did some digging; it looks like replaceEntities is only used by the C++ version of Xerces, if the 'http://apache.org/xml/features/disable-default-entity-resolution feature is set. Under Java, this feature just throws an exception.

At a bit of a loss here; maybe the solution is to write a custom entityResolver that just 'resolves' the entity back to its original content? Not sure if that'll throw the parser into a loop...

Looks like I'm not alone; if the JRuby author is having a hard time getting Xerces to play ball, I don't feel so bad:

http://groups.google.com/group/nokogiri-talk/browse_thread/thread/bbd4319811a4014e?fwc=1

@yokolet
Copy link
Member

yokolet commented Feb 20, 2012

Thanks for digging in this issue. I'll have a look.

@jvshahid
Copy link
Member

I don't think that's relevant anymore and I can't tell due to the lack of reproducible code. It seems to me that the JRuby implementation is passing all the sax parser tests that were referenced on the mailing list, so i'll go ahead and assume that the issue was fixed. Please reopen if you feel that the issue isn't fixed and have a code snippet that reproduces your problem.

Cheers,

@flavorjones
Copy link
Member

Reopening this, uncovered this bug again while investigating #1926. I've got a fix, and will close this once that PR is green and merged.

@flavorjones flavorjones reopened this Jun 30, 2024
@flavorjones flavorjones added this to the v1.17.0 milestone Jun 30, 2024
flavorjones added a commit that referenced this issue Jul 1, 2024
On CRuby, this fixes the fact that the parser was registering errors
when encountering general (non-predefined) entities. Now these
entities are resolved properly and converted into `#characters`
callbacks. Fixes #1926.

On JRuby, the SAX parser now respects the `#replace_entities`
attribute, which was previously ignored AND defaulted incorrectly to
`true`. The default now matches CRuby -- `false` -- and the parser
behavior matches CRuby with respect to entities. Fixes #614.

This commit also includes some granular tests of how the sax parser
handles different entities under different circumstances, which should
be clarifying for user reports like #1284 and #1500 that expect
predefined entities and character references to be treated like parsed
entities (which they aren't).
@flavorjones
Copy link
Member

This is fixed in #3265.

flavorjones added a commit that referenced this issue Jul 2, 2024
**What problem is this PR intended to solve?**

#1926 described an issue wherein the SAX parser was not correctly
resolving and replacing internal entities, and was instead reporting an
error for each entity reference. This PR includes a fix for that
problem.

I've removed the unnecessary "SAX tuple" from the SAX implementation,
replacing it with the `_private` struct member that libxml2 makes
available. Then I set up the parser context structs so that we can use
libxml2's standard SAX callbacks where they're useful (which is how I
addressed the above issue).

This PR also introduces a new feature, a SAX handler callback
`Document#reference` which allows callers to get entity-specific name
and replacement text information (rather than relying on the
`Document#characters` callback). This can be used to solve the original
issue in #1926 with this code: searls/eiwa#11

The behavior of the SAX parser with respect to entities is complex
enough that I wrote up a short doc in the `XML::SAX::Document` docstring
with a table and explanation. I've also added warnings to remind users
that `#replace_entities` is not safe to set when parsing untrusted
documents.

In the Java implementation, I've fixed the `#replace_entities` option in
the SAX parser context and set it to the proper default (`false`),
fixing #614. I've also corrected the value of the URI argument to
`Document#start_element_namespace` which was a blank string when it
should have been `nil`.

I've added quite a bit of testing around the SAX parser's handling of
entities.

I added and clarified quite a bit of documentation around SAX parsing
generally. Exception messages have been clarified in a couple of places,
and made consistent between the C and Java implementations. This should
address questions asked in issues #1500 and #1284.

Finally, I cleaned up some of the C code that implements SAX parsing,
naming functions more explicitly (and moving towards some kind of
standard naming convention).

Closes #1926.
Closes #614.


**Have you included adequate test coverage?**

Yes!


**Does this change affect the behavior of either the C or the Java
implementations?**

Yes, but the implementations are much more consistent with each other
now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants