Skip to content

Commit

Permalink
fix: overhaul SAX::Parser encoding handling (#3288)
Browse files Browse the repository at this point in the history
**What problem is this PR intended to solve?**

Previously, encoding overrides were not implemented for
XML::SAX::Parser#parse_memory (as reported in #918) and
XML::SAX::Parser#parse_file.

However, this commit goes further and significantly simplifies and
unifies the two SAX::ParserContext implementations and the two
SAX::Parser implementations.

This commit also allows Encoding objects and encoding names to be passed
into the SAX::ParserContext methods, and the XML memory and file methods
now accept and properly use passed encodings.

Finally, this commit also backfills a lot of test coverage for the XML
and the HTML4 sax parser encoding.

Closes #918


**Have you included adequate test coverage?**

Yes.


**Does this change affect the behavior of either the C or the Java
implementations?**

Yes, but they are more consistent with each other.
  • Loading branch information
flavorjones committed Jul 7, 2024
2 parents 3062700 + f67b294 commit 1ba1db1
Show file tree
Hide file tree
Showing 24 changed files with 1,154 additions and 545 deletions.
36 changes: 34 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,39 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA
* [CRuby] Update to rake-compiler-dock v1.5.1 for building precompiled native gems. [#3216] @flavorjones


### Notable changes

#### SAX Parsers

The XML and HTML4 SAX parsers have received a lot of attention in this release, and we've fixed multiple long-standing bugs with encoding and entity handling. In addition, libxml2 v2.13 has also made some underlying fixes and improvements to encoding and entity handling.

We're shipping these fixes in a minor release because we firmly believe the resulting behavior is correct and standards-compliant, however applications that have been depending on the buggy behavior may be impacted.

If your application relies on the SAX parsers, and in particular if you're SAX-parsing documents with parsed entities or incorrect encoding declarations, please read the changelog below carefully.


#### Fragment parsing

Document fragment parsing has been improved, particularly with respect to handling malformed fragments or fragments with implicit namespace prefixes. Namespace reconciliation still isn't where we want it to be, but it's an improvement.

HTML5 fragment parsing now allows the context node to be specified as a keyword argument to the `HTML5::DocumentFragment.parse` and `.new` methods, which in particular should allow for more flexible sanitization and support for the [draft HTML Sanitizer API](https://wicg.github.io/sanitizer-api/) in downstream libraries.


#### Error handling

In scenarios where multiple errors could be reported by the underlying parser, the errors will be aggregated into a single `Nokogiri::XML::SyntaxError` that is raised. Previously only the final error reported by libxml2 was raised which was often misleading if it was only a warning and not the fatal error.


#### Schema validation

We've resolved many long-standing bugs in the various schema classes, validation methods, and their error reporting. Behavior is now consistent across schema types and input types, as well as parser backends (Xerces and libxml2).


### Added

* Introduce support for a new SAX callback `XML::SAX::Document#reference`, which is called to report some parsed XML entities when `SAX::ParserContext#replace_entities` is set to the default value `false`. This is necessary functionality for some applications that were previously relying on incorrect entity error reporting which has been fixed (see below). For more information, read the docs for `Nokogiri::XML::SAX::Document`. [#1926] @flavorjones
* Introduce support for a new SAX callback `XML::SAX::Document#reference`, which is called to report some parsed XML entities when `XML::SAX::ParserContext#replace_entities` is set to the default value `false`. This is necessary functionality for some applications that were previously relying on incorrect entity error reporting which has been fixed (see below). For more information, read the docs for `Nokogiri::XML::SAX::Document`. [#1926] @flavorjones
* `XML::SAX::Parser#parse_memory` and `#parse_file` now accept an optional `encoding` argument. When not provided, the parser will fall back to the encoding passed to the initializer, and then fall back to autodetection. [#3288] @flavorjones
* `XML::SAX::ParserContext.memory` now accepts an optional `encoding` argument. When not provided, the encoding will be autodetected. [#3288] @flavorjones
* [CRuby] `Nokogiri::HTML5::Builder` is similar to `HTML4::Builder` but returns an `HTML5::Document`. [#3119] @flavorjones
* [CRuby] Attributes in an HTML5 document can be serialized individually, something that has always been supported by the HTML4 serializer. [#3125, #3127] @flavorjones
* [CRuby] Introduce a compile-time option, `--disable-xml2-legacy`, to remove from libxml2 its dependencies on `zlib` and `liblzma` and disable implicit `HTTP` network requests. These all remain enabled by default, and are present in the precompiled native gems. This option is a precursor for removing these libraries in a future major release, but may be interesting for the security-minded who do not need features like automatic decompression and would like to remove these dependencies. You can read more and give feedback on these plans in #3168. [#3247] @flavorjones
Expand All @@ -26,8 +56,9 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA
### Improved

* Documentation has been improved for `CSS.xpath_for`. [#3224] @flavorjones
* Documentation for the SAX parsing classes has been greatly improved, including the complex entity-handling behavior. [#3265] @flavorjones
* Documentation for the SAX parsing classes has been greatly improved, including encoding overrides and the complex entity-handling behavior. [#3265] @flavorjones
* `XML::Schema#read_memory` and `XML::RelaxNG#read_memory` are now Ruby methods that call `#from_document`. Previously these were native functions, but they were buggy on both CRuby and JRuby (but worse on JRuby) and so this is now useful, comparable in performance, and simpler code that is easier to maintain. [#2113, #2115] @flavorjones
* `XML::SAX::ParserContext.io`'s `encoding` argument is now optional, and can now be an `Encoding` or an encoding name. When not provided will default to autodetecting the encoding. [#3288] @flavorjones
* [CRuby] When compiling packaged libraries from source, allow users' `AR` and `LD` environment variables to set the archiver and linker commands, respectively. This augments the existing `CC` environment variable to set the compiler command. [#3165] @ziggythehamster
* [CRuby] The HTML5 parse methods accept a `:parse_noscript_content_as_text` keyword argument which will emulate the parsing behavior of a browser which has scripting enabled. [#3178, #3231] @stevecheckoway
* [CRuby] `HTML5::DocumentFragment.parse` and `.new` accept a `:context` keyword argument that is the parse context node or element name. Previously this could only be passed in as a positional argument to `.new` and not at all to `.parse`. @flavorjones
Expand Down Expand Up @@ -70,6 +101,7 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA
* The undocumented and unused method `Nokogiri::CSS.parse` is now deprecated and will generate a warning. The AST returned by this method is private and subject to change and removal in future versions of Nokogiri. This method will be removed in a future version of Nokogiri.
* Passing an options hash to `CSS.xpath_for` is now deprecated and will generate a warning. Use keyword arguments instead. This will become an error in a future version of Nokogiri.
* Passing an options hash to `HTML5::DocumentFragment.parse` is now deprecated and will generate a warning. Use keyword arguments instead. This will become an error in a future version of Nokogiri.
* Passing libxml2 encoding IDs to `SAX::ParserContext` methods is now deprecated and will generate a warning. The use of `SAX::Parser::ENCODINGS` is also deprecaed. Use `Encoding` objects or encoding names instead.


## v1.16.6 / 2024-06-13
Expand Down
215 changes: 44 additions & 171 deletions ext/java/nokogiri/Html4SaxParserContext.java
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,12 @@

import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.nio.charset.IllegalCharsetNameException;
import java.nio.charset.UnsupportedCharsetException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.xerces.parsers.AbstractSAXParser;
import net.sourceforge.htmlunit.cyberneko.parsers.SAXParser;
import org.jruby.Ruby;
import org.jruby.RubyClass;
import org.jruby.RubyEncoding;
import org.jruby.RubyFixnum;
import org.jruby.RubyString;
import org.jruby.anno.JRubyClass;
Expand All @@ -23,6 +19,8 @@
import nokogiri.internals.NokogiriHandler;
import static nokogiri.internals.NokogiriHelpers.rubyStringToString;

import static org.jruby.runtime.Helpers.invoke;

/**
* Class for Nokogiri::HTML4::SAX::ParserContext.
*
Expand Down Expand Up @@ -71,198 +69,73 @@ public class Html4SaxParserContext extends XmlSaxParserContext
}
}

@JRubyMethod(name = "memory", meta = true)
@JRubyMethod(name = "native_memory", meta = true)
public static IRubyObject
parse_memory(ThreadContext context,
IRubyObject klazz,
IRubyObject data,
IRubyObject encoding)
parse_memory(ThreadContext context, IRubyObject klazz, IRubyObject data, IRubyObject encoding)
{
Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klazz);
String javaEncoding = findEncodingName(context, encoding);
if (javaEncoding != null) {
CharSequence input = applyEncoding(rubyStringToString(data.convertToString()), javaEncoding);
ByteArrayInputStream istream = new ByteArrayInputStream(input.toString().getBytes());
ctx.setInputSource(istream);
ctx.getInputSource().setEncoding(javaEncoding);
}
return ctx;
}

public enum EncodingType {
NONE(0, "NONE"),
UTF_8(1, "UTF-8"),
UTF16LE(2, "UTF16LE"),
UTF16BE(3, "UTF16BE"),
UCS4LE(4, "UCS4LE"),
UCS4BE(5, "UCS4BE"),
EBCDIC(6, "EBCDIC"),
UCS4_2143(7, "ICS4-2143"),
UCS4_3412(8, "UCS4-3412"),
UCS2(9, "UCS2"),
ISO_8859_1(10, "ISO-8859-1"),
ISO_8859_2(11, "ISO-8859-2"),
ISO_8859_3(12, "ISO-8859-3"),
ISO_8859_4(13, "ISO-8859-4"),
ISO_8859_5(14, "ISO-8859-5"),
ISO_8859_6(15, "ISO-8859-6"),
ISO_8859_7(16, "ISO-8859-7"),
ISO_8859_8(17, "ISO-8859-8"),
ISO_8859_9(18, "ISO-8859-9"),
ISO_2022_JP(19, "ISO-2022-JP"),
SHIFT_JIS(20, "SHIFT-JIS"),
EUC_JP(21, "EUC-JP"),
ASCII(22, "ASCII");

private final int value;
private final String name;

EncodingType(int value, String name)
{
this.value = value;
this.name = name;
}

public int getValue()
{
return value;
}

public String toString()
{
return name;
}

private static transient EncodingType[] values;

// NOTE: assuming ordinal == value
static EncodingType get(final int ordinal)
{
EncodingType[] values = EncodingType.values;
if (values == null) {
values = EncodingType.values();
EncodingType.values = values;
String java_encoding = null;
if (encoding != context.runtime.getNil()) {
if (!(encoding instanceof RubyEncoding)) {
throw context.runtime.newTypeError("encoding must be kind_of Encoding");
}
if (ordinal >= 0 && ordinal < values.length) {
return values[ordinal];
}
return null;
java_encoding = ((RubyEncoding)encoding).toString();
}

}

private static String
findEncodingName(final int value)
{
EncodingType type = EncodingType.get(value);
if (type == null) { return null; }
assert type.value == value;
return type.name;
}

private static String
findEncodingName(ThreadContext context, IRubyObject encoding)
{
String rubyEncoding = null;
if (encoding instanceof RubyString) {
rubyEncoding = rubyStringToString((RubyString) encoding);
} else if (encoding instanceof RubyFixnum) {
rubyEncoding = findEncodingName(RubyFixnum.fix2int((RubyFixnum) encoding));
}
if (rubyEncoding == null) { return null; }
try {
return Charset.forName(rubyEncoding).displayName();
} catch (UnsupportedCharsetException e) {
throw context.getRuntime().newEncodingCompatibilityError(rubyEncoding + "is not supported");
} catch (IllegalCharsetNameException e) {
throw context.getRuntime().newEncodingError(e.getMessage());
}
}

private static final Pattern CHARSET_PATTERN = Pattern.compile("charset(()|\\s)=(()|\\s)([a-z]|-|_|\\d)+",
Pattern.CASE_INSENSITIVE);
Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klazz);
ctx.setStringInputSourceNoEnc(context, data, context.runtime.getNil());

private static CharSequence
applyEncoding(final String input, final String enc)
{
int start_pos = 0;
int end_pos = 0;
if (containsIgnoreCase(input, "charset")) {
Matcher m = CHARSET_PATTERN.matcher(input);
while (m.find()) {
start_pos = m.start();
end_pos = m.end();
}
if (java_encoding != null) {
ctx.getInputSource().setEncoding(java_encoding);
}
if (start_pos != end_pos) {
return new StringBuilder(input).replace(start_pos, end_pos, "charset=" + enc);
}
return input;
}

private static boolean
containsIgnoreCase(final String str, final String sub)
{
final int len = sub.length();
final int max = str.length() - len;

if (len == 0) { return true; }
final char c0Lower = Character.toLowerCase(sub.charAt(0));
final char c0Upper = Character.toUpperCase(sub.charAt(0));

for (int i = 0; i <= max; i++) {
final char ch = str.charAt(i);
if (ch != c0Lower && Character.toLowerCase(ch) != c0Lower && Character.toUpperCase(ch) != c0Upper) {
continue; // first char doesn't match
}

if (str.regionMatches(true, i + 1, sub, 0 + 1, len - 1)) {
return true;
}
}
return false;
return ctx;
}

@JRubyMethod(name = "file", meta = true)
@JRubyMethod(name = "native_file", meta = true)
public static IRubyObject
parse_file(ThreadContext context,
IRubyObject klass,
IRubyObject data,
IRubyObject encoding)
parse_file(ThreadContext context, IRubyObject klass, IRubyObject data, IRubyObject encoding)
{
if (!(data instanceof RubyString)) {
throw context.getRuntime().newTypeError("data must be kind_of String");
}
if (!(encoding instanceof RubyString)) {
throw context.getRuntime().newTypeError("data must be kind_of String");
String java_encoding = null;
if (encoding != context.runtime.getNil()) {
if (!(encoding instanceof RubyEncoding)) {
throw context.runtime.newTypeError("encoding must be kind_of Encoding");
}
java_encoding = ((RubyEncoding)encoding).toString();
}

Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klass);
ctx.setInputSourceFile(context, data);
String javaEncoding = findEncodingName(context, encoding);
if (javaEncoding != null) {
ctx.getInputSource().setEncoding(javaEncoding);

if (java_encoding != null) {
ctx.getInputSource().setEncoding(java_encoding);
}

return ctx;
}

@JRubyMethod(name = "io", meta = true)
@JRubyMethod(name = "native_io", meta = true)
public static IRubyObject
parse_io(ThreadContext context,
IRubyObject klass,
IRubyObject data,
IRubyObject encoding)
parse_io(ThreadContext context, IRubyObject klazz, IRubyObject data, IRubyObject encoding)
{
if (!(encoding instanceof RubyFixnum)) {
throw context.getRuntime().newTypeError("encoding must be kind_of String");
if (!invoke(context, data, "respond_to?", context.runtime.newSymbol("read")).isTrue()) {
throw context.runtime.newTypeError("argument expected to respond to :read");
}

Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klass);
String java_encoding = null;
if (encoding != context.runtime.getNil()) {
if (!(encoding instanceof RubyEncoding)) {
throw context.runtime.newTypeError("encoding must be kind_of Encoding");
}
java_encoding = ((RubyEncoding)encoding).toString();
}

Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klazz);
ctx.setIOInputSource(context, data, context.nil);
String javaEncoding = findEncodingName(context, encoding);
if (javaEncoding != null) {
ctx.getInputSource().setEncoding(javaEncoding);

if (java_encoding != null) {
ctx.getInputSource().setEncoding(java_encoding);
}

return ctx;
}

Expand Down
Loading

0 comments on commit 1ba1db1

Please sign in to comment.