diff --git a/CHANGELOG.md b/CHANGELOG.md index da16b06d69..fd76546856 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,9 +15,39 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA * [CRuby] Update to rake-compiler-dock v1.5.1 for building precompiled native gems. [#3216] @flavorjones +### Notable changes + +#### SAX Parsers + +The XML and HTML4 SAX parsers have received a lot of attention in this release, and we've fixed multiple long-standing bugs with encoding and entity handling. In addition, libxml2 v2.13 has also made some underlying fixes and improvements to encoding and entity handling. + +We're shipping these fixes in a minor release because we firmly believe the resulting behavior is correct and standards-compliant, however applications that have been depending on the buggy behavior may be impacted. + +If your application relies on the SAX parsers, and in particular if you're SAX-parsing documents with parsed entities or incorrect encoding declarations, please read the changelog below carefully. + + +#### Fragment parsing + +Document fragment parsing has been improved, particularly with respect to handling malformed fragments or fragments with implicit namespace prefixes. Namespace reconciliation still isn't where we want it to be, but it's an improvement. + +HTML5 fragment parsing now allows the context node to be specified as a keyword argument to the `HTML5::DocumentFragment.parse` and `.new` methods, which in particular should allow for more flexible sanitization and support for the [draft HTML Sanitizer API](https://wicg.github.io/sanitizer-api/) in downstream libraries. + + +#### Error handling + +In scenarios where multiple errors could be reported by the underlying parser, the errors will be aggregated into a single `Nokogiri::XML::SyntaxError` that is raised. Previously only the final error reported by libxml2 was raised which was often misleading if it was only a warning and not the fatal error. + + +#### Schema validation + +We've resolved many long-standing bugs in the various schema classes, validation methods, and their error reporting. Behavior is now consistent across schema types and input types, as well as parser backends (Xerces and libxml2). + + ### Added -* Introduce support for a new SAX callback `XML::SAX::Document#reference`, which is called to report some parsed XML entities when `SAX::ParserContext#replace_entities` is set to the default value `false`. This is necessary functionality for some applications that were previously relying on incorrect entity error reporting which has been fixed (see below). For more information, read the docs for `Nokogiri::XML::SAX::Document`. [#1926] @flavorjones +* Introduce support for a new SAX callback `XML::SAX::Document#reference`, which is called to report some parsed XML entities when `XML::SAX::ParserContext#replace_entities` is set to the default value `false`. This is necessary functionality for some applications that were previously relying on incorrect entity error reporting which has been fixed (see below). For more information, read the docs for `Nokogiri::XML::SAX::Document`. [#1926] @flavorjones +* `XML::SAX::Parser#parse_memory` and `#parse_file` now accept an optional `encoding` argument. When not provided, the parser will fall back to the encoding passed to the initializer, and then fall back to autodetection. [#3288] @flavorjones +* `XML::SAX::ParserContext.memory` now accepts an optional `encoding` argument. When not provided, the encoding will be autodetected. [#3288] @flavorjones * [CRuby] `Nokogiri::HTML5::Builder` is similar to `HTML4::Builder` but returns an `HTML5::Document`. [#3119] @flavorjones * [CRuby] Attributes in an HTML5 document can be serialized individually, something that has always been supported by the HTML4 serializer. [#3125, #3127] @flavorjones * [CRuby] Introduce a compile-time option, `--disable-xml2-legacy`, to remove from libxml2 its dependencies on `zlib` and `liblzma` and disable implicit `HTTP` network requests. These all remain enabled by default, and are present in the precompiled native gems. This option is a precursor for removing these libraries in a future major release, but may be interesting for the security-minded who do not need features like automatic decompression and would like to remove these dependencies. You can read more and give feedback on these plans in #3168. [#3247] @flavorjones @@ -26,8 +56,9 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA ### Improved * Documentation has been improved for `CSS.xpath_for`. [#3224] @flavorjones -* Documentation for the SAX parsing classes has been greatly improved, including the complex entity-handling behavior. [#3265] @flavorjones +* Documentation for the SAX parsing classes has been greatly improved, including encoding overrides and the complex entity-handling behavior. [#3265] @flavorjones * `XML::Schema#read_memory` and `XML::RelaxNG#read_memory` are now Ruby methods that call `#from_document`. Previously these were native functions, but they were buggy on both CRuby and JRuby (but worse on JRuby) and so this is now useful, comparable in performance, and simpler code that is easier to maintain. [#2113, #2115] @flavorjones +* `XML::SAX::ParserContext.io`'s `encoding` argument is now optional, and can now be an `Encoding` or an encoding name. When not provided will default to autodetecting the encoding. [#3288] @flavorjones * [CRuby] When compiling packaged libraries from source, allow users' `AR` and `LD` environment variables to set the archiver and linker commands, respectively. This augments the existing `CC` environment variable to set the compiler command. [#3165] @ziggythehamster * [CRuby] The HTML5 parse methods accept a `:parse_noscript_content_as_text` keyword argument which will emulate the parsing behavior of a browser which has scripting enabled. [#3178, #3231] @stevecheckoway * [CRuby] `HTML5::DocumentFragment.parse` and `.new` accept a `:context` keyword argument that is the parse context node or element name. Previously this could only be passed in as a positional argument to `.new` and not at all to `.parse`. @flavorjones @@ -70,6 +101,7 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA * The undocumented and unused method `Nokogiri::CSS.parse` is now deprecated and will generate a warning. The AST returned by this method is private and subject to change and removal in future versions of Nokogiri. This method will be removed in a future version of Nokogiri. * Passing an options hash to `CSS.xpath_for` is now deprecated and will generate a warning. Use keyword arguments instead. This will become an error in a future version of Nokogiri. * Passing an options hash to `HTML5::DocumentFragment.parse` is now deprecated and will generate a warning. Use keyword arguments instead. This will become an error in a future version of Nokogiri. +* Passing libxml2 encoding IDs to `SAX::ParserContext` methods is now deprecated and will generate a warning. The use of `SAX::Parser::ENCODINGS` is also deprecaed. Use `Encoding` objects or encoding names instead. ## v1.16.6 / 2024-06-13 diff --git a/ext/java/nokogiri/Html4SaxParserContext.java b/ext/java/nokogiri/Html4SaxParserContext.java index b6944b8ffb..e167f11a3a 100644 --- a/ext/java/nokogiri/Html4SaxParserContext.java +++ b/ext/java/nokogiri/Html4SaxParserContext.java @@ -2,16 +2,12 @@ import java.io.ByteArrayInputStream; import java.io.InputStream; -import java.nio.charset.Charset; -import java.nio.charset.IllegalCharsetNameException; -import java.nio.charset.UnsupportedCharsetException; -import java.util.regex.Matcher; -import java.util.regex.Pattern; import org.apache.xerces.parsers.AbstractSAXParser; import net.sourceforge.htmlunit.cyberneko.parsers.SAXParser; import org.jruby.Ruby; import org.jruby.RubyClass; +import org.jruby.RubyEncoding; import org.jruby.RubyFixnum; import org.jruby.RubyString; import org.jruby.anno.JRubyClass; @@ -23,6 +19,8 @@ import nokogiri.internals.NokogiriHandler; import static nokogiri.internals.NokogiriHelpers.rubyStringToString; +import static org.jruby.runtime.Helpers.invoke; + /** * Class for Nokogiri::HTML4::SAX::ParserContext. * @@ -71,198 +69,73 @@ public class Html4SaxParserContext extends XmlSaxParserContext } } - @JRubyMethod(name = "memory", meta = true) + @JRubyMethod(name = "native_memory", meta = true) public static IRubyObject - parse_memory(ThreadContext context, - IRubyObject klazz, - IRubyObject data, - IRubyObject encoding) + parse_memory(ThreadContext context, IRubyObject klazz, IRubyObject data, IRubyObject encoding) { - Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klazz); - String javaEncoding = findEncodingName(context, encoding); - if (javaEncoding != null) { - CharSequence input = applyEncoding(rubyStringToString(data.convertToString()), javaEncoding); - ByteArrayInputStream istream = new ByteArrayInputStream(input.toString().getBytes()); - ctx.setInputSource(istream); - ctx.getInputSource().setEncoding(javaEncoding); - } - return ctx; - } - - public enum EncodingType { - NONE(0, "NONE"), - UTF_8(1, "UTF-8"), - UTF16LE(2, "UTF16LE"), - UTF16BE(3, "UTF16BE"), - UCS4LE(4, "UCS4LE"), - UCS4BE(5, "UCS4BE"), - EBCDIC(6, "EBCDIC"), - UCS4_2143(7, "ICS4-2143"), - UCS4_3412(8, "UCS4-3412"), - UCS2(9, "UCS2"), - ISO_8859_1(10, "ISO-8859-1"), - ISO_8859_2(11, "ISO-8859-2"), - ISO_8859_3(12, "ISO-8859-3"), - ISO_8859_4(13, "ISO-8859-4"), - ISO_8859_5(14, "ISO-8859-5"), - ISO_8859_6(15, "ISO-8859-6"), - ISO_8859_7(16, "ISO-8859-7"), - ISO_8859_8(17, "ISO-8859-8"), - ISO_8859_9(18, "ISO-8859-9"), - ISO_2022_JP(19, "ISO-2022-JP"), - SHIFT_JIS(20, "SHIFT-JIS"), - EUC_JP(21, "EUC-JP"), - ASCII(22, "ASCII"); - - private final int value; - private final String name; - - EncodingType(int value, String name) - { - this.value = value; - this.name = name; - } - - public int getValue() - { - return value; - } - - public String toString() - { - return name; - } - - private static transient EncodingType[] values; - - // NOTE: assuming ordinal == value - static EncodingType get(final int ordinal) - { - EncodingType[] values = EncodingType.values; - if (values == null) { - values = EncodingType.values(); - EncodingType.values = values; + String java_encoding = null; + if (encoding != context.runtime.getNil()) { + if (!(encoding instanceof RubyEncoding)) { + throw context.runtime.newTypeError("encoding must be kind_of Encoding"); } - if (ordinal >= 0 && ordinal < values.length) { - return values[ordinal]; - } - return null; + java_encoding = ((RubyEncoding)encoding).toString(); } - } - - private static String - findEncodingName(final int value) - { - EncodingType type = EncodingType.get(value); - if (type == null) { return null; } - assert type.value == value; - return type.name; - } - - private static String - findEncodingName(ThreadContext context, IRubyObject encoding) - { - String rubyEncoding = null; - if (encoding instanceof RubyString) { - rubyEncoding = rubyStringToString((RubyString) encoding); - } else if (encoding instanceof RubyFixnum) { - rubyEncoding = findEncodingName(RubyFixnum.fix2int((RubyFixnum) encoding)); - } - if (rubyEncoding == null) { return null; } - try { - return Charset.forName(rubyEncoding).displayName(); - } catch (UnsupportedCharsetException e) { - throw context.getRuntime().newEncodingCompatibilityError(rubyEncoding + "is not supported"); - } catch (IllegalCharsetNameException e) { - throw context.getRuntime().newEncodingError(e.getMessage()); - } - } - - private static final Pattern CHARSET_PATTERN = Pattern.compile("charset(()|\\s)=(()|\\s)([a-z]|-|_|\\d)+", - Pattern.CASE_INSENSITIVE); + Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klazz); + ctx.setStringInputSourceNoEnc(context, data, context.runtime.getNil()); - private static CharSequence - applyEncoding(final String input, final String enc) - { - int start_pos = 0; - int end_pos = 0; - if (containsIgnoreCase(input, "charset")) { - Matcher m = CHARSET_PATTERN.matcher(input); - while (m.find()) { - start_pos = m.start(); - end_pos = m.end(); - } + if (java_encoding != null) { + ctx.getInputSource().setEncoding(java_encoding); } - if (start_pos != end_pos) { - return new StringBuilder(input).replace(start_pos, end_pos, "charset=" + enc); - } - return input; - } - private static boolean - containsIgnoreCase(final String str, final String sub) - { - final int len = sub.length(); - final int max = str.length() - len; - - if (len == 0) { return true; } - final char c0Lower = Character.toLowerCase(sub.charAt(0)); - final char c0Upper = Character.toUpperCase(sub.charAt(0)); - - for (int i = 0; i <= max; i++) { - final char ch = str.charAt(i); - if (ch != c0Lower && Character.toLowerCase(ch) != c0Lower && Character.toUpperCase(ch) != c0Upper) { - continue; // first char doesn't match - } - - if (str.regionMatches(true, i + 1, sub, 0 + 1, len - 1)) { - return true; - } - } - return false; + return ctx; } - @JRubyMethod(name = "file", meta = true) + @JRubyMethod(name = "native_file", meta = true) public static IRubyObject - parse_file(ThreadContext context, - IRubyObject klass, - IRubyObject data, - IRubyObject encoding) + parse_file(ThreadContext context, IRubyObject klass, IRubyObject data, IRubyObject encoding) { - if (!(data instanceof RubyString)) { - throw context.getRuntime().newTypeError("data must be kind_of String"); - } - if (!(encoding instanceof RubyString)) { - throw context.getRuntime().newTypeError("data must be kind_of String"); + String java_encoding = null; + if (encoding != context.runtime.getNil()) { + if (!(encoding instanceof RubyEncoding)) { + throw context.runtime.newTypeError("encoding must be kind_of Encoding"); + } + java_encoding = ((RubyEncoding)encoding).toString(); } Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klass); ctx.setInputSourceFile(context, data); - String javaEncoding = findEncodingName(context, encoding); - if (javaEncoding != null) { - ctx.getInputSource().setEncoding(javaEncoding); + + if (java_encoding != null) { + ctx.getInputSource().setEncoding(java_encoding); } + return ctx; } - @JRubyMethod(name = "io", meta = true) + @JRubyMethod(name = "native_io", meta = true) public static IRubyObject - parse_io(ThreadContext context, - IRubyObject klass, - IRubyObject data, - IRubyObject encoding) + parse_io(ThreadContext context, IRubyObject klazz, IRubyObject data, IRubyObject encoding) { - if (!(encoding instanceof RubyFixnum)) { - throw context.getRuntime().newTypeError("encoding must be kind_of String"); + if (!invoke(context, data, "respond_to?", context.runtime.newSymbol("read")).isTrue()) { + throw context.runtime.newTypeError("argument expected to respond to :read"); } - Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klass); + String java_encoding = null; + if (encoding != context.runtime.getNil()) { + if (!(encoding instanceof RubyEncoding)) { + throw context.runtime.newTypeError("encoding must be kind_of Encoding"); + } + java_encoding = ((RubyEncoding)encoding).toString(); + } + + Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klazz); ctx.setIOInputSource(context, data, context.nil); - String javaEncoding = findEncodingName(context, encoding); - if (javaEncoding != null) { - ctx.getInputSource().setEncoding(javaEncoding); + + if (java_encoding != null) { + ctx.getInputSource().setEncoding(java_encoding); } + return ctx; } diff --git a/ext/java/nokogiri/XmlSaxParserContext.java b/ext/java/nokogiri/XmlSaxParserContext.java index 332eb39918..4c20349ea3 100644 --- a/ext/java/nokogiri/XmlSaxParserContext.java +++ b/ext/java/nokogiri/XmlSaxParserContext.java @@ -1,10 +1,14 @@ package nokogiri; import nokogiri.internals.*; +import static nokogiri.internals.NokogiriHelpers.rubyStringToString; + import org.apache.xerces.parsers.AbstractSAXParser; import org.jruby.Ruby; import org.jruby.RubyClass; +import org.jruby.RubyEncoding; import org.jruby.RubyFixnum; +import org.jruby.RubyString; import org.jruby.anno.JRubyClass; import org.jruby.anno.JRubyMethod; import org.jruby.exceptions.RaiseException; @@ -14,6 +18,7 @@ import org.xml.sax.SAXException; import org.xml.sax.SAXParseException; +import java.io.ByteArrayInputStream; import java.io.IOException; import java.io.InputStream; @@ -90,16 +95,26 @@ public class XmlSaxParserContext extends ParserContext * Create a new parser context that will parse the string * data. */ - @JRubyMethod(name = "memory", meta = true) + @JRubyMethod(name = "native_memory", meta = true) public static IRubyObject - parse_memory(ThreadContext context, - IRubyObject klazz, - IRubyObject data) + parse_memory(ThreadContext context, IRubyObject klazz, IRubyObject data, IRubyObject encoding) { - final Ruby runtime = context.runtime; - XmlSaxParserContext ctx = newInstance(runtime, (RubyClass) klazz); - ctx.initialize(runtime); - ctx.setStringInputSource(context, data, runtime.getNil()); + String java_encoding = null; + if (encoding != context.runtime.getNil()) { + if (!(encoding instanceof RubyEncoding)) { + throw context.runtime.newTypeError("encoding must be kind_of Encoding"); + } + java_encoding = ((RubyEncoding)encoding).toString(); + } + + XmlSaxParserContext ctx = newInstance(context.runtime, (RubyClass) klazz); + ctx.initialize(context.runtime); + ctx.setStringInputSourceNoEnc(context, data, context.runtime.getNil()); + + if (java_encoding != null) { + ctx.getInputSource().setEncoding(java_encoding); + } + return ctx; } @@ -107,16 +122,26 @@ public class XmlSaxParserContext extends ParserContext * Create a new parser context that will read from the file * data and parse. */ - @JRubyMethod(name = "file", meta = true) + @JRubyMethod(name = "native_file", meta = true) public static IRubyObject - parse_file(ThreadContext context, - IRubyObject klazz, - IRubyObject data) + parse_file(ThreadContext context, IRubyObject klazz, IRubyObject data, IRubyObject encoding) { - final Ruby runtime = context.runtime; - XmlSaxParserContext ctx = newInstance(runtime, (RubyClass) klazz); - ctx.initialize(context.getRuntime()); + String java_encoding = null; + if (encoding != context.runtime.getNil()) { + if (!(encoding instanceof RubyEncoding)) { + throw context.runtime.newTypeError("encoding must be kind_of Encoding"); + } + java_encoding = ((RubyEncoding)encoding).toString(); + } + + XmlSaxParserContext ctx = newInstance(context.runtime, (RubyClass) klazz); + ctx.initialize(context.runtime); ctx.setInputSourceFile(context, data); + + if (java_encoding != null) { + ctx.getInputSource().setEncoding(java_encoding); + } + return ctx; } @@ -126,21 +151,30 @@ public class XmlSaxParserContext extends ParserContext * * TODO: Currently ignores encoding enc. */ - @JRubyMethod(name = "io", meta = true) + @JRubyMethod(name = "native_io", meta = true) public static IRubyObject - parse_io(ThreadContext context, - IRubyObject klazz, - IRubyObject data, - IRubyObject encoding) + parse_io(ThreadContext context, IRubyObject klazz, IRubyObject data, IRubyObject encoding) { - // check the type of the unused encoding to match behavior of CRuby - if (!(encoding instanceof RubyFixnum)) { - throw context.getRuntime().newTypeError("encoding must be kind_of String"); + if (!invoke(context, data, "respond_to?", context.runtime.newSymbol("read")).isTrue()) { + throw context.runtime.newTypeError("argument expected to respond to :read"); } - final Ruby runtime = context.runtime; - XmlSaxParserContext ctx = newInstance(runtime, (RubyClass) klazz); - ctx.initialize(runtime); - ctx.setIOInputSource(context, data, runtime.getNil()); + + String java_encoding = null; + if (encoding != context.runtime.getNil()) { + if (!(encoding instanceof RubyEncoding)) { + throw context.runtime.newTypeError("encoding must be kind_of Encoding"); + } + java_encoding = ((RubyEncoding)encoding).toString(); + } + + XmlSaxParserContext ctx = newInstance(context.runtime, (RubyClass) klazz); + ctx.initialize(context.runtime); + ctx.setIOInputSource(context, data, context.runtime.getNil()); + + if (java_encoding != null) { + ctx.getInputSource().setEncoding(java_encoding); + } + return ctx; } diff --git a/ext/java/nokogiri/internals/ParserContext.java b/ext/java/nokogiri/internals/ParserContext.java index bb5cdfce08..7bd9d80c14 100644 --- a/ext/java/nokogiri/internals/ParserContext.java +++ b/ext/java/nokogiri/internals/ParserContext.java @@ -105,6 +105,33 @@ public abstract class ParserContext extends RubyObject source.setEncoding(java_encoding); } + public void + setStringInputSourceNoEnc(ThreadContext context, IRubyObject data, IRubyObject url) + { + source = new InputSource(); + ParserContext.setUrl(context, source, url); + + Ruby ruby = context.getRuntime(); + + if (data.isNil()) { + throw ruby.newTypeError("wrong argument type nil (expected String)"); + } + if (!(data instanceof RubyString)) { + throw ruby.newTypeError("wrong argument type " + data.getMetaClass() + " (expected String)"); + } + + RubyString stringData = (RubyString) data; + + ByteList bytes = stringData.getByteList(); + + stringDataSize = bytes.length() - bytes.begin(); + if (stringDataSize == 0) { + throw context.runtime.newRuntimeError("input string cannot be empty"); + } + ByteArrayInputStream stream = new ByteArrayInputStream(bytes.unsafeBytes(), bytes.begin(), bytes.length()); + source.setByteStream(stream); + } + public static void setUrl(ThreadContext context, InputSource source, IRubyObject url) { diff --git a/ext/nokogiri/extconf.rb b/ext/nokogiri/extconf.rb index f1ffcc2f7f..1012480768 100644 --- a/ext/nokogiri/extconf.rb +++ b/ext/nokogiri/extconf.rb @@ -1134,6 +1134,7 @@ def compile have_func("xmlCtxtSetOptions") # introduced in libxml2 2.13.0 have_func("xmlCtxtGetOptions") # introduced in libxml2 2.14.0 +have_func("xmlSwitchEncodingName") # introduced in libxml2 2.13.0 have_func("rb_category_warning") # introduced in Ruby 3.0 other_library_versions_string = OTHER_LIBRARY_VERSIONS.map { |k, v| [k, v].join(":") }.join(",") diff --git a/ext/nokogiri/html4_sax_parser_context.c b/ext/nokogiri/html4_sax_parser_context.c index 2c7cb82eca..6f971d1fcf 100644 --- a/ext/nokogiri/html4_sax_parser_context.c +++ b/ext/nokogiri/html4_sax_parser_context.c @@ -2,52 +2,56 @@ VALUE cNokogiriHtml4SaxParserContext ; +/* :nodoc: */ static VALUE -noko_html4_sax_parser_s_parse_memory(VALUE klass, VALUE data, VALUE encoding) +noko_html4_sax_parser_context_s_native_memory(VALUE rb_class, VALUE rb_input, VALUE rb_encoding) { - htmlParserCtxtPtr ctxt; - - Check_Type(data, T_STRING); - - if (!(int)RSTRING_LEN(data)) { + Check_Type(rb_input, T_STRING); + if (!(int)RSTRING_LEN(rb_input)) { rb_raise(rb_eRuntimeError, "input string cannot be empty"); } - ctxt = htmlCreateMemoryParserCtxt(StringValuePtr(data), - (int)RSTRING_LEN(data)); - if (ctxt->sax) { - xmlFree(ctxt->sax); - ctxt->sax = NULL; + if (!NIL_P(rb_encoding) && !rb_obj_is_kind_of(rb_encoding, rb_cEncoding)) { + rb_raise(rb_eTypeError, "argument must be an Encoding object"); } - if (RTEST(encoding)) { - xmlCharEncodingHandlerPtr enc = xmlFindCharEncodingHandler(StringValueCStr(encoding)); - if (enc != NULL) { - xmlSwitchToEncoding(ctxt, enc); - if (ctxt->errNo == XML_ERR_UNSUPPORTED_ENCODING) { - rb_raise(rb_eRuntimeError, "Unsupported encoding %s", - StringValueCStr(encoding)); - } - } + htmlParserCtxtPtr c_context = + htmlCreateMemoryParserCtxt(StringValuePtr(rb_input), (int)RSTRING_LEN(rb_input)); + if (!c_context) { + rb_raise(rb_eRuntimeError, "failed to create xml sax parser context"); + } + + noko_xml_sax_parser_context_set_encoding(c_context, rb_encoding); + + if (c_context->sax) { + xmlFree(c_context->sax); + c_context->sax = NULL; } - return noko_xml_sax_parser_context_wrap(klass, ctxt); + return noko_xml_sax_parser_context_wrap(rb_class, c_context); } +/* :nodoc: */ static VALUE -noko_html4_sax_parser_context_s_parse_file(VALUE klass, VALUE filename, VALUE encoding) +noko_html4_sax_parser_context_s_native_file(VALUE rb_class, VALUE rb_filename, VALUE rb_encoding) { - htmlParserCtxtPtr ctxt = htmlCreateFileParserCtxt( - StringValueCStr(filename), - StringValueCStr(encoding) - ); - - if (ctxt->sax) { - xmlFree(ctxt->sax); - ctxt->sax = NULL; + if (!NIL_P(rb_encoding) && !rb_obj_is_kind_of(rb_encoding, rb_cEncoding)) { + rb_raise(rb_eTypeError, "argument must be an Encoding object"); + } + + htmlParserCtxtPtr c_context = htmlCreateFileParserCtxt(StringValueCStr(rb_filename), NULL); + if (!c_context) { + rb_raise(rb_eRuntimeError, "failed to create xml sax parser context"); + } + + noko_xml_sax_parser_context_set_encoding(c_context, rb_encoding); + + if (c_context->sax) { + xmlFree(c_context->sax); + c_context->sax = NULL; } - return noko_xml_sax_parser_context_wrap(klass, ctxt); + return noko_xml_sax_parser_context_wrap(rb_class, c_context); } static VALUE @@ -84,10 +88,10 @@ noko_init_html_sax_parser_context(void) cNokogiriHtml4SaxParserContext = rb_define_class_under(mNokogiriHtml4Sax, "ParserContext", cNokogiriXmlSaxParserContext); - rb_define_singleton_method(cNokogiriHtml4SaxParserContext, "memory", - noko_html4_sax_parser_s_parse_memory, 2); - rb_define_singleton_method(cNokogiriHtml4SaxParserContext, "file", - noko_html4_sax_parser_context_s_parse_file, 2); + rb_define_singleton_method(cNokogiriHtml4SaxParserContext, "native_memory", + noko_html4_sax_parser_context_s_native_memory, 2); + rb_define_singleton_method(cNokogiriHtml4SaxParserContext, "native_file", + noko_html4_sax_parser_context_s_native_file, 2); rb_define_method(cNokogiriHtml4SaxParserContext, "parse_with", noko_html4_sax_parser_context__parse_with, 1); diff --git a/ext/nokogiri/libxml2_polyfill.c b/ext/nokogiri/libxml2_polyfill.c index 70ed189e7c..750b1b52a2 100644 --- a/ext/nokogiri/libxml2_polyfill.c +++ b/ext/nokogiri/libxml2_polyfill.c @@ -95,3 +95,20 @@ xmlCtxtGetOptions(xmlParserCtxtPtr ctxt) return (ctxt->options); } #endif + +#ifndef HAVE_XMLSWITCHENCODINGNAME +int +xmlSwitchEncodingName(xmlParserCtxtPtr ctxt, const char *encoding) +{ + if (ctxt == NULL) { + return (-1); + } + + xmlCharEncodingHandlerPtr handler = xmlFindCharEncodingHandler(encoding); + if (handler == NULL) { + return (-1); + } + + return (xmlSwitchToEncoding(ctxt, handler)); +} +#endif diff --git a/ext/nokogiri/nokogiri.h b/ext/nokogiri/nokogiri.h index dfedc88ec7..b75ebc47fa 100644 --- a/ext/nokogiri/nokogiri.h +++ b/ext/nokogiri/nokogiri.h @@ -63,6 +63,9 @@ int xmlCtxtSetOptions(xmlParserCtxtPtr ctxt, int options); #ifndef HAVE_XMLCTXTGETOPTIONS int xmlCtxtGetOptions(xmlParserCtxtPtr ctxt); #endif +#ifndef HAVE_XMLSWITCHENCODINGNAME +int xmlSwitchEncodingName(xmlParserCtxtPtr ctxt, const char *encoding); +#endif #define XMLNS_PREFIX "xmlns" #define XMLNS_PREFIX_LEN 6 /* including either colon or \0 */ @@ -205,6 +208,7 @@ xmlParserCtxtPtr noko_xml_sax_push_parser_unwrap(VALUE rb_parser); VALUE noko_xml_sax_parser_context_wrap(VALUE klass, xmlParserCtxtPtr c_context); xmlParserCtxtPtr noko_xml_sax_parser_context_unwrap(VALUE rb_context); +void noko_xml_sax_parser_context_set_encoding(xmlParserCtxtPtr c_context, VALUE rb_encoding); #define DOC_RUBY_OBJECT_TEST(x) ((nokogiriTuplePtr)(x->_private)) #define DOC_RUBY_OBJECT(x) (((nokogiriTuplePtr)(x->_private))->doc) diff --git a/ext/nokogiri/xml_node.c b/ext/nokogiri/xml_node.c index 69870afc9a..c1c8c6938a 100644 --- a/ext/nokogiri/xml_node.c +++ b/ext/nokogiri/xml_node.c @@ -159,7 +159,7 @@ relink_namespace(xmlNodePtr reparented) /* reparent. */ if (NULL == reparented->ns) { return; } - /* When a node gets reparented, walk it's children to make sure that */ + /* When a node gets reparented, walk its children to make sure that */ /* their namespaces are reparented as well. */ child = reparented->children; while (NULL != child) { diff --git a/ext/nokogiri/xml_sax_parser_context.c b/ext/nokogiri/xml_sax_parser_context.c index 46c6b81f0a..75fe2e4f01 100644 --- a/ext/nokogiri/xml_sax_parser_context.c +++ b/ext/nokogiri/xml_sax_parser_context.c @@ -43,39 +43,60 @@ noko_xml_sax_parser_context_wrap(VALUE klass, xmlParserCtxtPtr c_context) return TypedData_Wrap_Struct(klass, &xml_sax_parser_context_type, c_context); } +void +noko_xml_sax_parser_context_set_encoding(xmlParserCtxtPtr c_context, VALUE rb_encoding) +{ + if (!NIL_P(rb_encoding)) { + VALUE rb_encoding_name = rb_funcall(rb_encoding, rb_intern("name"), 0); -/* - * call-seq: - * io(input, encoding_id) - * - * Create a parser context for an +input+ IO which will assume +encoding+ - * - * [Parameters] - * - +io+ (IO) The readable IO object from which to read input - * - +encoding_id+ (Integer) The libxml2 encoding ID to use, see SAX::Parser::ENCODINGS - * - * [Returns] Nokogiri::XML::SAX::ParserContext - * - * 💡 Calling Nokogiri::XML::SAX::Parser.parse is more convenient for most use cases. - */ + char *encoding_name = StringValueCStr(rb_encoding_name); + if (encoding_name) { + libxmlStructuredErrorHandlerState handler_state; + VALUE rb_errors = rb_ary_new(); + + noko__structured_error_func_save_and_set(&handler_state, (void *)rb_errors, noko__error_array_pusher); + + int result = xmlSwitchEncodingName(c_context, encoding_name); + + noko__structured_error_func_restore(&handler_state); + + if (result != 0) { + xmlFreeParserCtxt(c_context); + + VALUE exception = rb_funcall(cNokogiriXmlSyntaxError, rb_intern("aggregate"), 1, rb_errors); + if (!NIL_P(exception)) { + rb_exc_raise(exception); + } else { + rb_raise(rb_eRuntimeError, "could not set encoding"); + } + } + } + } +} + +/* :nodoc: */ static VALUE -noko_xml_sax_parser_context_s_io(VALUE rb_class, VALUE rb_io, VALUE rb_encoding_id) +noko_xml_sax_parser_context_s_native_io(VALUE rb_class, VALUE rb_io, VALUE rb_encoding) { - xmlParserCtxtPtr c_context; - xmlCharEncoding c_encoding = (xmlCharEncoding)NUM2INT(rb_encoding_id); - if (!rb_respond_to(rb_io, id_read)) { rb_raise(rb_eTypeError, "argument expected to respond to :read"); } - c_context = xmlCreateIOParserCtxt(NULL, NULL, - (xmlInputReadCallback)noko_io_read, - (xmlInputCloseCallback)noko_io_close, - (void *)rb_io, c_encoding); + if (!NIL_P(rb_encoding) && !rb_obj_is_kind_of(rb_encoding, rb_cEncoding)) { + rb_raise(rb_eTypeError, "argument must be an Encoding object"); + } + + xmlParserCtxtPtr c_context = + xmlCreateIOParserCtxt(NULL, NULL, + (xmlInputReadCallback)noko_io_read, + (xmlInputCloseCallback)noko_io_close, + (void *)rb_io, XML_CHAR_ENCODING_NONE); if (!c_context) { rb_raise(rb_eRuntimeError, "failed to create xml sax parser context"); } + noko_xml_sax_parser_context_set_encoding(c_context, rb_encoding); + if (c_context->sax) { xmlFree(c_context->sax); c_context->sax = NULL; @@ -84,23 +105,20 @@ noko_xml_sax_parser_context_s_io(VALUE rb_class, VALUE rb_io, VALUE rb_encoding_ return noko_xml_sax_parser_context_wrap(rb_class, c_context); } -/* - * call-seq: - * file(path) - * - * Create a parser context for the file at +path+. - * - * [Parameters] - * - +path+ (String) The path to the input file - * - * [Returns] Nokogiri::XML::SAX::ParserContext - * - * 💡 Calling Nokogiri::XML::SAX::Parser.parse_file is more convenient for most use cases. - */ +/* :nodoc: */ static VALUE -noko_xml_sax_parser_context_s_file(VALUE rb_class, VALUE rb_path) +noko_xml_sax_parser_context_s_native_file(VALUE rb_class, VALUE rb_path, VALUE rb_encoding) { + if (!NIL_P(rb_encoding) && !rb_obj_is_kind_of(rb_encoding, rb_cEncoding)) { + rb_raise(rb_eTypeError, "argument must be an Encoding object"); + } + xmlParserCtxtPtr c_context = xmlCreateFileParserCtxt(StringValueCStr(rb_path)); + if (!c_context) { + rb_raise(rb_eRuntimeError, "failed to create xml sax parser context"); + } + + noko_xml_sax_parser_context_set_encoding(c_context, rb_encoding); if (c_context->sax) { xmlFree(c_context->sax); @@ -110,32 +128,27 @@ noko_xml_sax_parser_context_s_file(VALUE rb_class, VALUE rb_path) return noko_xml_sax_parser_context_wrap(rb_class, c_context); } -/* - * call-seq: - * memory(input) - * - * Create a parser context for the +input+ String. - * - * [Parameters] - * - +input+ (String) The input string to be parsed. - * - * [Returns] Nokogiri::XML::SAX::ParserContext - * - * 💡 Calling Nokogiri::XML::SAX::Parser.parse is more convenient for most use cases. - */ +/* :nodoc: */ static VALUE -noko_xml_sax_parser_context_s_memory(VALUE rb_class, VALUE rb_input) +noko_xml_sax_parser_context_s_native_memory(VALUE rb_class, VALUE rb_input, VALUE rb_encoding) { - xmlParserCtxtPtr c_context; - Check_Type(rb_input, T_STRING); - if (!(int)RSTRING_LEN(rb_input)) { rb_raise(rb_eRuntimeError, "input string cannot be empty"); } - c_context = xmlCreateMemoryParserCtxt(StringValuePtr(rb_input), - (int)RSTRING_LEN(rb_input)); + if (!NIL_P(rb_encoding) && !rb_obj_is_kind_of(rb_encoding, rb_cEncoding)) { + rb_raise(rb_eTypeError, "argument must be an Encoding object"); + } + + xmlParserCtxtPtr c_context = + xmlCreateMemoryParserCtxt(StringValuePtr(rb_input), (int)RSTRING_LEN(rb_input)); + if (!c_context) { + rb_raise(rb_eRuntimeError, "failed to create xml sax parser context"); + } + + noko_xml_sax_parser_context_set_encoding(c_context, rb_encoding); + if (c_context->sax) { xmlFree(c_context->sax); c_context->sax = NULL; @@ -149,6 +162,9 @@ noko_xml_sax_parser_context_s_memory(VALUE rb_class, VALUE rb_input) * parse_with(sax_handler) * * Use +sax_handler+ and parse the current document + * + * 💡 Calling this method directly is discouraged. Use Nokogiri::XML::SAX::Parser methods which are + * more convenient for most use cases. */ static VALUE noko_xml_sax_parser_context__parse_with(VALUE rb_context, VALUE rb_sax_parser) @@ -353,9 +369,12 @@ noko_init_xml_sax_parser_context(void) rb_undef_alloc_func(cNokogiriXmlSaxParserContext); - rb_define_singleton_method(cNokogiriXmlSaxParserContext, "io", noko_xml_sax_parser_context_s_io, 2); - rb_define_singleton_method(cNokogiriXmlSaxParserContext, "memory", noko_xml_sax_parser_context_s_memory, 1); - rb_define_singleton_method(cNokogiriXmlSaxParserContext, "file", noko_xml_sax_parser_context_s_file, 1); + rb_define_singleton_method(cNokogiriXmlSaxParserContext, "native_io", + noko_xml_sax_parser_context_s_native_io, 2); + rb_define_singleton_method(cNokogiriXmlSaxParserContext, "native_memory", + noko_xml_sax_parser_context_s_native_memory, 2); + rb_define_singleton_method(cNokogiriXmlSaxParserContext, "native_file", + noko_xml_sax_parser_context_s_native_file, 2); rb_define_method(cNokogiriXmlSaxParserContext, "parse_with", noko_xml_sax_parser_context__parse_with, 1); rb_define_method(cNokogiriXmlSaxParserContext, "replace_entities=", diff --git a/lib/nokogiri/class_resolver.rb b/lib/nokogiri/class_resolver.rb index e2b21f6a6c..bab871996b 100644 --- a/lib/nokogiri/class_resolver.rb +++ b/lib/nokogiri/class_resolver.rb @@ -18,7 +18,7 @@ module Nokogiri # module ClassResolver # #related_class restricts matching namespaces to those matching this set. - VALID_NAMESPACES = Set.new(["HTML", "HTML4", "HTML5", "XML"]) + VALID_NAMESPACES = Set.new(["HTML", "HTML4", "HTML5", "XML", "SAX"]) # :call-seq: # related_class(class_name) → Class diff --git a/lib/nokogiri/html4/sax/parser.rb b/lib/nokogiri/html4/sax/parser.rb index 1f8c03af6c..77063ec261 100644 --- a/lib/nokogiri/html4/sax/parser.rb +++ b/lib/nokogiri/html4/sax/parser.rb @@ -3,60 +3,45 @@ module Nokogiri module HTML4 ### - # Nokogiri lets you write a SAX parser to process HTML but get HTML correction features. + # Nokogiri provides a SAX parser to process HTML4 which will provide HTML recovery + # ("autocorrection") features. # # See Nokogiri::HTML4::SAX::Parser for a basic example of using a SAX parser with HTML. # # For more information on SAX parsers, see Nokogiri::XML::SAX + # module SAX ### - # This class lets you perform SAX style parsing on HTML with HTML error correction. + # This parser is a SAX style parser that reads its input as it deems necessary. The parser + # takes a Nokogiri::XML::SAX::Document, an optional encoding, then given an HTML input, sends + # messages to the Nokogiri::XML::SAX::Document. + # + # ⚠ This is an HTML4 parser and so may not support some HTML5 features and behaviors. # # Here is a basic usage example: # - # class MyDoc < Nokogiri::XML::SAX::Document + # class MyHandler < Nokogiri::XML::SAX::Document # def start_element name, attributes = [] # puts "found a #{name}" # end # end # - # parser = Nokogiri::HTML4::SAX::Parser.new(MyDoc.new) - # parser.parse(File.read(ARGV[0], mode: 'rb')) + # parser = Nokogiri::HTML4::SAX::Parser.new(MyHandler.new) + # + # # Hand an IO object to the parser, which will read the HTML from the IO. + # File.open(path_to_html) do |f| + # parser.parse(f) + # end + # + # For more information on \SAX parsers, see Nokogiri::XML::SAX or the parent class + # Nokogiri::XML::SAX::Parser. + # + # Also see Nokogiri::XML::SAX::Document for the available events. # - # For more information on SAX parsers, see Nokogiri::XML::SAX class Parser < Nokogiri::XML::SAX::Parser - ### - # Parse html stored in +data+ using +encoding+ - def parse_memory(data, encoding = "UTF-8") - raise TypeError unless String === data - return if data.empty? - - ctx = ParserContext.memory(data, encoding) - yield ctx if block_given? - ctx.parse_with(self) - end - - ### - # Parse given +io+ - def parse_io(io, encoding = "UTF-8") - check_encoding(encoding) - @encoding = encoding - ctx = ParserContext.io(io, ENCODINGS[encoding]) - yield ctx if block_given? - ctx.parse_with(self) - end - - ### - # Parse a file with +filename+ - def parse_file(filename, encoding = "UTF-8") - raise ArgumentError unless filename - raise Errno::ENOENT unless File.exist?(filename) - raise Errno::EISDIR if File.directory?(filename) - - ctx = ParserContext.file(filename, encoding) - yield ctx if block_given? - ctx.parse_with(self) - end + # this class inherits its behavior from Nokogiri::XML::SAX::Parser, but note that superclass + # uses Nokogiri::ClassResolver to use HTML4::SAX::ParserContext as the context class for + # this class, which is where the real behavioral differences are implemented. end end end diff --git a/lib/nokogiri/html4/sax/parser_context.rb b/lib/nokogiri/html4/sax/parser_context.rb index 00240f2489..26d7142874 100644 --- a/lib/nokogiri/html4/sax/parser_context.rb +++ b/lib/nokogiri/html4/sax/parser_context.rb @@ -4,16 +4,11 @@ module Nokogiri module HTML4 module SAX ### - # Context for HTML SAX parsers. This class is usually not instantiated by the user. Instead, - # you should be looking at Nokogiri::HTML4::SAX::Parser + # Context object to invoke the HTML4 SAX parser on the SAX::Document handler. + # + # 💡 This class is usually not instantiated by the user. Use Nokogiri::HTML4::SAX::Parser + # instead. class ParserContext < Nokogiri::XML::SAX::ParserContext - def self.new(thing, encoding = "UTF-8") - if [:read, :close].all? { |x| thing.respond_to?(x) } - super - else - memory(thing, encoding) - end - end end end end diff --git a/lib/nokogiri/xml/document.rb b/lib/nokogiri/xml/document.rb index d4a4e705b2..8d3c2dfb8a 100644 --- a/lib/nokogiri/xml/document.rb +++ b/lib/nokogiri/xml/document.rb @@ -362,7 +362,7 @@ def decorators(key) end ## - # Validate this Document against it's DTD. Returns a list of errors on + # Validate this Document against its DTD. Returns a list of errors on # the document or +nil+ when there is no DTD. def validate return unless internal_subset diff --git a/lib/nokogiri/xml/sax.rb b/lib/nokogiri/xml/sax.rb index 9415a3887b..211dc14d98 100644 --- a/lib/nokogiri/xml/sax.rb +++ b/lib/nokogiri/xml/sax.rb @@ -6,23 +6,23 @@ module XML # SAX Parsers are event-driven parsers. # # Two SAX parsers for XML are available, a parser that reads from a string or IO object as it - # feels necessary, and a parser that lets you spoon feed it XML. If you want to let Nokogiri - # deal with reading your XML, use the Nokogiri::XML::SAX::Parser. If you want to have fine grain - # control over the XML input, use the Nokogiri::XML::SAX::PushParser. + # feels necessary, and a parser that you explicitly feed XML in chunks. If you want to let + # Nokogiri deal with reading your XML, use the Nokogiri::XML::SAX::Parser. If you want to have + # fine grain control over the XML input, use the Nokogiri::XML::SAX::PushParser. # - # If you want to do SAX style parsing using HTML, check out Nokogiri::HTML4::SAX. + # If you want to do SAX style parsing of HTML, check out Nokogiri::HTML4::SAX. # # The basic way a SAX style parser works is by creating a parser, telling the parser about the # events we're interested in, then giving the parser some XML to process. The parser will notify # you when it encounters events you said you would like to know about. # - # To register for events, you simply subclass Nokogiri::XML::SAX::Document, and implement the - # methods for which you would like notification. + # To register for events, subclass Nokogiri::XML::SAX::Document and implement the methods for + # which you would like notification. # # For example, if I want to be notified when a document ends, and when an element starts, I # would write a class like this: # - # class MyDocument < Nokogiri::XML::SAX::Document + # class MyHandler < Nokogiri::XML::SAX::Document # def end_document # puts "the document has ended" # end @@ -35,7 +35,7 @@ module XML # Then I would instantiate a SAX parser with this document, and feed the parser some XML # # # Create a new parser - # parser = Nokogiri::XML::SAX::Parser.new(MyDocument.new) + # parser = Nokogiri::XML::SAX::Parser.new(MyHandler.new) # # # Feed the parser some XML # parser.parse(File.open(ARGV[0])) diff --git a/lib/nokogiri/xml/sax/document.rb b/lib/nokogiri/xml/sax/document.rb index 0560bb635b..a6acb6802a 100644 --- a/lib/nokogiri/xml/sax/document.rb +++ b/lib/nokogiri/xml/sax/document.rb @@ -12,7 +12,7 @@ module SAX # # To only be notified about start and end element events, write a class like this: # - # class MyDocument < Nokogiri::XML::SAX::Document + # class MyHandler < Nokogiri::XML::SAX::Document # def start_element name, attrs = [] # puts "#{name} started!" # end diff --git a/lib/nokogiri/xml/sax/parser.rb b/lib/nokogiri/xml/sax/parser.rb index 7e035529f3..2eeba84e62 100644 --- a/lib/nokogiri/xml/sax/parser.rb +++ b/lib/nokogiri/xml/sax/parser.rb @@ -4,16 +4,15 @@ module Nokogiri module XML module SAX ### - # This parser is a SAX style parser that reads it's input as it - # deems necessary. The parser takes a Nokogiri::XML::SAX::Document, - # an optional encoding, then given an XML input, sends messages to - # the Nokogiri::XML::SAX::Document. + # This parser is a SAX style parser that reads its input as it deems necessary. The parser + # takes a Nokogiri::XML::SAX::Document, an optional encoding, then given an XML input, sends + # messages to the Nokogiri::XML::SAX::Document. # # Here is an example of using this parser: # # # Create a subclass of Nokogiri::XML::SAX::Document and implement # # the events we care about: - # class MyDoc < Nokogiri::XML::SAX::Document + # class MyHandler < Nokogiri::XML::SAX::Document # def start_element name, attrs = [] # puts "starting: #{name}" # end @@ -23,20 +22,28 @@ module SAX # end # end # - # # Create our parser - # parser = Nokogiri::XML::SAX::Parser.new(MyDoc.new) + # parser = Nokogiri::XML::SAX::Parser.new(MyHandler.new) # - # # Send some XML to the parser - # parser.parse(File.open(ARGV[0])) + # # Hand an IO object to the parser, which will read the XML from the IO. + # File.open(path_to_xml) do |f| + # parser.parse(f) + # end + # + # For more information about \SAX parsers, see Nokogiri::XML::SAX. + # + # Also see Nokogiri::XML::SAX::Document for the available events. + # + # For \HTML documents, use the subclass Nokogiri::HTML4::SAX::Parser. # - # For more information about SAX parsers, see Nokogiri::XML::SAX. Also - # see Nokogiri::XML::SAX::Document for the available events. class Parser + # to dynamically resolve ParserContext in inherited methods + include Nokogiri::ClassResolver + + # Structure used for marshalling attributes for some callbacks in XML::SAX::Document. class Attribute < Struct.new(:localname, :prefix, :uri, :value) end - # Encodings this parser supports - ENCODINGS = { + ENCODINGS = { # :nodoc: "NONE" => 0, # No char encoding detected "UTF-8" => 1, # UTF-8 "UTF16LE" => 2, # UTF-16 little endian @@ -61,6 +68,8 @@ class Attribute < Struct.new(:localname, :prefix, :uri, :value) "EUC-JP" => 21, # EUC-JP "ASCII" => 22, # pure ASCII } + REVERSE_ENCODINGS = ENCODINGS.invert # :nodoc: + deprecate_constant :ENCODINGS # The Nokogiri::XML::SAX::Document where events will be sent. attr_accessor :document @@ -68,9 +77,23 @@ class Attribute < Struct.new(:localname, :prefix, :uri, :value) # The encoding beings used for this document. attr_accessor :encoding - # Create a new Parser with +doc+ and +encoding+ - def initialize(doc = Nokogiri::XML::SAX::Document.new, encoding = "UTF-8") - @encoding = check_encoding(encoding) + ### + # :call-seq: + # new ⇒ SAX::Parser + # new(handler) ⇒ SAX::Parser + # new(handler, encoding) ⇒ SAX::Parser + # + # Create a new Parser. + # + # [Parameters] + # - +handler+ (optional Nokogiri::XML::SAX::Document) The document that will receive + # events. Will create a new Nokogiri::XML::SAX::Document if not given, which is accessible + # through the #document attribute. + # - +encoding+ (optional Encoding, String, nil) An Encoding or encoding name to use when + # parsing the input. (default +nil+ for auto-detection) + # + def initialize(doc = Nokogiri::XML::SAX::Document.new, encoding = nil) + @encoding = encoding @document = doc @warned = false @@ -78,49 +101,98 @@ def initialize(doc = Nokogiri::XML::SAX::Document.new, encoding = "UTF-8") end ### - # Parse given +thing+ which may be a string containing xml, or an - # IO object. - def parse(thing, &block) - if thing.respond_to?(:read) && thing.respond_to?(:close) - parse_io(thing, &block) + # :call-seq: + # parse(input) { |parser_context| ... } + # + # Parse the input, sending events to the SAX::Document at #document. + # + # [Parameters] + # - +input+ (String, IO) The input to parse. + # + # If +input+ quacks like a readable IO object, this method forwards to Parser.parse_io, + # otherwise it forwards to Parser.parse_memory. + # + # [Yields] + # If a block is given, the underlying ParserContext object will be yielded. This can be used + # to set options on the parser context before parsing begins. + # + def parse(input, &block) + if input.respond_to?(:read) && input.respond_to?(:close) + parse_io(input, &block) else - parse_memory(thing, &block) + parse_memory(input, &block) end end ### - # Parse given +io+ + # :call-seq: + # parse_io(io) { |parser_context| ... } + # parse_io(io, encoding) { |parser_context| ... } + # + # Parse an input stream. + # + # [Parameters] + # - +io+ (IO) The readable IO object from which to read input + # - +encoding+ (optional Encoding, String, nil) An Encoding or encoding name to use when + # parsing the input, or +nil+ for auto-detection. (default #encoding) + # + # [Yields] + # If a block is given, the underlying ParserContext object will be yielded. This can be used + # to set options on the parser context before parsing begins. + # def parse_io(io, encoding = @encoding) - ctx = ParserContext.io(io, ENCODINGS[check_encoding(encoding)]) + ctx = related_class("ParserContext").io(io, encoding) yield ctx if block_given? ctx.parse_with(self) end ### - # Parse a file with +filename+ - def parse_file(filename) - raise ArgumentError unless filename - raise Errno::ENOENT unless File.exist?(filename) - raise Errno::EISDIR if File.directory?(filename) - - ctx = ParserContext.file(filename) + # :call-seq: + # parse_memory(input) { |parser_context| ... } + # parse_memory(input, encoding) { |parser_context| ... } + # + # Parse an input string. + # + # [Parameters] + # - +input+ (String) The input string to be parsed. + # - +encoding+ (optional Encoding, String, nil) An Encoding or encoding name to use when + # parsing the input, or +nil+ for auto-detection. (default #encoding) + # + # [Yields] + # If a block is given, the underlying ParserContext object will be yielded. This can be used + # to set options on the parser context before parsing begins. + # + def parse_memory(input, encoding = @encoding) + ctx = related_class("ParserContext").memory(input, encoding) yield ctx if block_given? ctx.parse_with(self) end - def parse_memory(data) - ctx = ParserContext.memory(data) + ### + # :call-seq: + # parse_file(filename) { |parser_context| ... } + # parse_file(filename, encoding) { |parser_context| ... } + # + # Parse a file. + # + # [Parameters] + # - +filename+ (String) The path to the file to be parsed. + # - +encoding+ (optional Encoding, String, nil) An Encoding or encoding name to use when + # parsing the input, or +nil+ for auto-detection. (default #encoding) + # + # [Yields] + # If a block is given, the underlying ParserContext object will be yielded. This can be used + # to set options on the parser context before parsing begins. + # + def parse_file(filename, encoding = @encoding) + raise ArgumentError, "no filename provided" unless filename + raise Errno::ENOENT unless File.exist?(filename) + raise Errno::EISDIR if File.directory?(filename) + + ctx = related_class("ParserContext").file(filename, encoding) yield ctx if block_given? ctx.parse_with(self) end - - private - - def check_encoding(encoding) - encoding.upcase.tap do |enc| - raise ArgumentError, "'#{enc}' is not a valid encoding" unless ENCODINGS[enc] - end - end end end end diff --git a/lib/nokogiri/xml/sax/parser_context.rb b/lib/nokogiri/xml/sax/parser_context.rb index 48981775ff..20f335fe62 100644 --- a/lib/nokogiri/xml/sax/parser_context.rb +++ b/lib/nokogiri/xml/sax/parser_context.rb @@ -4,15 +4,123 @@ module Nokogiri module XML module SAX ### - # Context for XML SAX parsers. This class is usually not instantiated - # by the user. Instead, you should be looking at - # Nokogiri::XML::SAX::Parser + # Context object to invoke the XML SAX parser on the SAX::Document handler. + # + # 💡 This class is usually not instantiated by the user. Use Nokogiri::XML::SAX::Parser + # instead. class ParserContext - def self.new(thing, encoding = "UTF-8") - if [:read, :close].all? { |x| thing.respond_to?(x) } - io(thing, Parser::ENCODINGS[encoding]) - else - memory(thing) + class << self + ### + # :call-seq: + # new(input) + # new(input, encoding) + # + # Create a parser context for an IO or a String. This is a shorthand method for + # ParserContext.io and ParserContext.memory. + # + # [Parameters] + # - +input+ (IO, String) A String or a readable IO object + # - +encoding+ (optional) (Encoding) The +Encoding+ to use, or the name of an + # encoding to use (default +nil+, encoding will be autodetected) + # + # If +input+ quacks like a readable IO object, this method forwards to ParserContext.io, + # otherwise it forwards to ParserContext.memory. + # + # [Returns] Nokogiri::XML::SAX::ParserContext + # + def new(input, encoding = nil) + if [:read, :close].all? { |x| input.respond_to?(x) } + io(input, encoding) + else + memory(input, encoding) + end + end + + ### + # :call-seq: + # io(input) + # io(input, encoding) + # + # Create a parser context for an +input+ IO which will assume +encoding+ + # + # [Parameters] + # - +io+ (IO) The readable IO object from which to read input + # - +encoding+ (optional) (Encoding) The +Encoding+ to use, or the name of an + # encoding to use (default +nil+, encoding will be autodetected) + # + # [Returns] Nokogiri::XML::SAX::ParserContext + # + # 💡 Calling this method directly is discouraged. Use Nokogiri::XML::SAX::Parser parse + # methods which are more convenient for most use cases. + # + def io(input, encoding = nil) + native_io(input, resolve_encoding(encoding)) + end + + ### + # :call-seq: + # memory(input) + # memory(input, encoding) + # + # Create a parser context for the +input+ String. + # + # [Parameters] + # - +input+ (String) The input string to be parsed. + # - +encoding+ (optional) (Encoding, String) The +Encoding+ to use, or the name of an encoding to + # use (default +nil+, encoding will be autodetected) + # + # [Returns] Nokogiri::XML::SAX::ParserContext + # + # 💡 Calling this method directly is discouraged. Use Nokogiri::XML::SAX::Parser parse methods + # which are more convenient for most use cases. + # + def memory(input, encoding = nil) + native_memory(input, resolve_encoding(encoding)) + end + + ### + # :call-seq: + # file(path) + # file(path, encoding) + # + # Create a parser context for the file at +path+. + # + # [Parameters] + # - +path+ (String) The path to the input file + # - +encoding+ (optional) (Encoding, String) The +Encoding+ to use, or the name of an encoding to + # use (default +nil+, encoding will be autodetected) + # + # [Returns] Nokogiri::XML::SAX::ParserContext + # + # 💡 Calling this method directly is discouraged. Use Nokogiri::XML::SAX::Parser.parse_file which + # is more convenient for most use cases. + def file(input, encoding = nil) + native_file(input, resolve_encoding(encoding)) + end + + private def resolve_encoding(encoding) + case encoding + when Encoding + encoding + + when nil + nil # totally fine, parser will guess encoding + + when Integer + warn("Passing an integer to Nokogiri::XML::SAX::ParserContext.io is deprecated. Use an Encoding object instead. This will become an error in a future release.", uplevel: 2, category: :deprecated) + + return nil if encoding == Parser::ENCODINGS["NONE"] + + encoding = Parser::REVERSE_ENCODINGS[encoding] + raise ArgumentError, "Invalid libxml2 encoding id #{encoding}" if encoding.nil? + Encoding.find(encoding) + + when String + Encoding.find(encoding) + + else + raise ArgumentError, "Cannot resolve #{encoding.inspect} to an Encoding" + end end end end diff --git a/lib/xsd/xmlparser/nokogiri.rb b/lib/xsd/xmlparser/nokogiri.rb index 4b0d55cfe5..dfa33f08b7 100644 --- a/lib/xsd/xmlparser/nokogiri.rb +++ b/lib/xsd/xmlparser/nokogiri.rb @@ -7,10 +7,9 @@ module XMLParser ### # Nokogiri XML parser for soap4r. # - # Nokogiri may be used as the XML parser in soap4r. Simply require - # 'xsd/xmlparser/nokogiri' in your soap4r applications, and soap4r - # will use Nokogiri as it's XML parser. No other changes should be - # required to use Nokogiri as the XML parser. + # Nokogiri may be used as the XML parser in soap4r. Require 'xsd/xmlparser/nokogiri' in your + # soap4r applications, and soap4r will use Nokogiri as its XML parser. No other changes should + # be required to use Nokogiri as the XML parser. # # Example (using UW ITS Web Services): # diff --git a/test/html4/sax/test_parser.rb b/test/html4/sax/test_parser.rb index 8ec9bbf861..9f549c4b3f 100644 --- a/test/html4/sax/test_parser.rb +++ b/test/html4/sax/test_parser.rb @@ -9,9 +9,9 @@ class TestCase describe Nokogiri::HTML4::SAX::Parser do let(:parser) { Nokogiri::HTML4::SAX::Parser.new(Doc.new) } - it "parse_empty_document" do - # This caused a segfault in libxml 2.6.x - assert_nil(parser.parse("")) + it "raises an error on empty content" do + e = assert_raises(RuntimeError) { parser.parse("") } + assert_equal("input string cannot be empty", e.message) end it "parse_empty_file" do @@ -59,15 +59,107 @@ class TestCase end end - it "parse_force_encoding" do - parser.parse_memory(<<-HTML, "UTF-8") - - Информация + describe "encoding" do + let(:html_encoding_iso8859) { <<~HTML } + + B\xF6hnhardt HTML - assert_equal( - "Информация", - parser.document.data.join.strip, - ) + + # this input string is really UTF-8 but is marked as ISO-8859-1 + let(:html_encoding_broken) { <<~HTML } + + Böhnhardt + HTML + + # this input string is really ISO-8859-1 but is marked as UTF-8 + let(:html_encoding_broken2) { <<~HTML } + + B\xF6hnhardt + HTML + + it "is nil by default to indicate encoding should be autodetected" do + parser = Nokogiri::HTML4::SAX::Parser.new(Doc.new) + assert_nil(parser.encoding) + end + + it "can be set in the initializer" do + assert_equal("UTF-8", Nokogiri::HTML4::SAX::Parser.new(Doc.new, "UTF-8").encoding) + assert_equal("ISO-2022-JP", Nokogiri::HTML4::SAX::Parser.new(Doc.new, "ISO-2022-JP").encoding) + end + + it "raises when given an invalid encoding name" do + assert_raises(ArgumentError) do + Nokogiri::HTML4::SAX::Parser.new(Doc.new, "not an encoding").parse_io(StringIO.new("")) + end + assert_raises(ArgumentError) do + Nokogiri::HTML4::SAX::Parser.new(Doc.new, "not an encoding").parse_memory("") + end + assert_raises(ArgumentError) { parser.parse_io(StringIO.new(""), "not an encoding") } + assert_raises(ArgumentError) { parser.parse_memory("", "not an encoding") } + end + + it "autodetects the encoding if not overridden" do + parser = Nokogiri::HTML4::SAX::Parser.new(Doc.new) + parser.parse(html_encoding_iso8859) + + # correctly converted the input ISO-8859-1 to UTF-8 for the callback + assert_equal("Böhnhardt", parser.document.data.join.strip) + end + + it "overrides the ISO-8859-1 document's encoding when set via initializer" do + parser = Nokogiri::HTML4::SAX::Parser.new(Doc.new) + parser.parse_memory(html_encoding_broken) + + assert_equal("Böhnhardt", parser.document.data.join.strip) + + parser = Nokogiri::HTML4::SAX::Parser.new(Doc.new, "UTF-8") + parser.parse_memory(html_encoding_broken) + + assert_equal("Böhnhardt", parser.document.data.join.strip) + end + + it "overrides the UTF-8 document's encoding when set via initializer" do + if Nokogiri.uses_libxml?(">= 2.13.0") # nekohtml is a better guesser than libxml2 + parser = Nokogiri::HTML4::SAX::Parser.new(Doc.new) + parser.parse_memory(html_encoding_broken2) + + assert(parser.document.errors.any? { |e| e.match(/Invalid byte/) }) + end + + parser = Nokogiri::HTML4::SAX::Parser.new(Doc.new) + parser.parse_memory(html_encoding_broken2, "ISO-8859-1") + + assert_equal("Böhnhardt", parser.document.data.join.strip) + refute(parser.document.errors.any? { |e| e.match(/Invalid byte/) }) + end + + it "can be set via parse_io" do + if Nokogiri.uses_libxml?("< 2.13.0") + skip("older libxml2 encoding detection is sus") + end + + parser = Nokogiri::HTML4::SAX::Parser.new(Doc.new) + parser.parse_io(StringIO.new(html_encoding_broken), "UTF-8") + + assert_equal("Böhnhardt", parser.document.data.join.strip) + + parser = Nokogiri::HTML4::SAX::Parser.new(Doc.new) + parser.parse_io(StringIO.new(html_encoding_broken2), "ISO-8859-1") + + assert_equal("Böhnhardt", parser.document.data.join.strip) + end + + it "can be set via parse_memory" do + parser = Nokogiri::HTML4::SAX::Parser.new(Doc.new) + parser.parse_memory(html_encoding_broken, "UTF-8") + + assert_equal("Böhnhardt", parser.document.data.join.strip) + + parser = Nokogiri::HTML4::SAX::Parser.new(Doc.new) + parser.parse_memory(html_encoding_broken2, "ISO-8859-1") + + assert_equal("Böhnhardt", parser.document.data.join.strip) + end end it "parse_document" do diff --git a/test/html4/sax/test_parser_context.rb b/test/html4/sax/test_parser_context.rb index 72c0f63fc7..1ee7454319 100644 --- a/test/html4/sax/test_parser_context.rb +++ b/test/html4/sax/test_parser_context.rb @@ -3,53 +3,152 @@ require "helper" -module Nokogiri - module HTML - module SAX - class TestParserContext < Nokogiri::SAX::TestCase - def test_from_io +module Nokogiri::HTML4::SAX + describe Nokogiri::HTML4::SAX::ParserContext do + describe "constructor" do + describe ".new" do + it "handles IO" do ctx = ParserContext.new(StringIO.new("fo"), "UTF-8") assert(ctx) end - def test_from_string + it "handles String" do ctx = ParserContext.new("blah blah") assert(ctx) end + end + + it ".file" do + ctx = ParserContext.file(Nokogiri::TestCase::HTML_FILE, "UTF-8") + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + ctx.parse_with(parser) + + assert(parser.document.start_document_called) + assert(parser.document.end_document_called) + end + + it "gracefully handles invalid types" do + assert_raises(TypeError) { ParserContext.new(0xcafecafe) } + assert_raises(TypeError) { ParserContext.memory(0xcafecafe) } + assert_raises(TypeError) { ParserContext.io(0xcafecafe) } + assert_raises(TypeError) { ParserContext.file(0xcafecafe) } + end + + describe "encoding" do + # this input string is really ISO-8859-1 but is marked as UTF-8 + let(:html_encoding_broken2) { <<~HTML } + + B\xF6hnhardt + HTML - def test_parse_with - ctx = ParserContext.new("blah") + it "gracefully handles nonsense encodings" do assert_raises(ArgumentError) do - ctx.parse_with(nil) + ParserContext.io(StringIO.new("asdf"), "not-an-encoding") + end + assert_raises(ArgumentError) do + ParserContext.memory("asdf", "not-an-encoding") + end + assert_raises(ArgumentError) do + ParserContext.file(Nokogiri::TestCase::XML_FILE, "not-an-encoding") + end + end + + describe ".io" do + it "supports passing encoding name" do + pc = ParserContext.io(StringIO.new(html_encoding_broken2), "ISO-8859-1") + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) + + assert_equal("Böhnhardt", parser.document.data.join.strip) + end + + it "supports passing Encoding" do + pc = ParserContext.io(StringIO.new(html_encoding_broken2), Encoding::ISO_8859_1) + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) + + assert_equal("Böhnhardt", parser.document.data.join.strip) + end + + it "supports passing libxml2 encoding id" do + enc = nil + assert_output(nil, /deprecated/) do + enc = Parser::ENCODINGS["ISO-8859-1"] + end + + pc = nil + assert_output(nil, /deprecated/) do + pc = ParserContext.io(StringIO.new(html_encoding_broken2), enc) + end + + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) + + assert_equal("Böhnhardt", parser.document.data.join.strip) end end - def test_parse_with_sax_parser - refute_raises do - xml = "" - ctx = ParserContext.new(xml) - parser = Parser.new(Doc.new) - ctx.parse_with(parser) + describe ".memory" do + it "supports passing encoding name" do + pc = ParserContext.memory(html_encoding_broken2, "ISO-8859-1") + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) + + assert_equal("Böhnhardt", parser.document.data.join.strip) + end + + it "supports passing Encoding" do + pc = ParserContext.memory(html_encoding_broken2, Encoding::ISO_8859_1) + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) + + assert_equal("Böhnhardt", parser.document.data.join.strip) end end - def test_from_file - refute_raises do - ctx = ParserContext.file(HTML_FILE, "UTF-8") - parser = Parser.new(Doc.new) - ctx.parse_with(parser) + describe ".file" do + let(:file) do + Tempfile.new.tap do |f| + f.write html_encoding_broken2 + f.close + end + end + + it "supports passing encoding name" do + pc = ParserContext.file(file.path, "ISO-8859-1") + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) + + assert_equal("Böhnhardt", parser.document.data.join.strip) + end + + it "supports passing Encoding" do + pc = ParserContext.file(file.path, Encoding::ISO_8859_1) + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) + + assert_equal("Böhnhardt", parser.document.data.join.strip) end end + end + end - def test_graceful_handling_of_invalid_types - assert_raises(TypeError) { ParserContext.new(0xcafecafe) } - assert_raises(TypeError) { ParserContext.memory(0xcafecafe, "UTF-8") } - assert_raises(TypeError) { ParserContext.io(0xcafecafe, 1) } - assert_raises(TypeError) { ParserContext.io(StringIO.new("asdf"), "should be an index into ENCODINGS") } - assert_raises(TypeError) { ParserContext.file(0xcafecafe, "UTF-8") } - assert_raises(TypeError) { ParserContext.file("path/to/file", 0xcafecafe) } + describe "#parse_with" do + it "raises when passed nil" do + ctx = ParserContext.new("blah") + assert_raises(ArgumentError) do + ctx.parse_with(nil) end end + + it "parses when passed a sax parser" do + ctx = ParserContext.new("") + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + + assert_nil(ctx.parse_with(parser)) + assert(parser.document.start_document_called) + assert(parser.document.end_document_called) + end end end end diff --git a/test/test_class_resolver.rb b/test/test_class_resolver.rb new file mode 100644 index 0000000000..f8bbadba3f --- /dev/null +++ b/test/test_class_resolver.rb @@ -0,0 +1,56 @@ +# frozen_string_literal: true + +require "helper" + +describe Nokogiri::ClassResolver do + describe Nokogiri::XML::Node do + it "finds the right things" do + assert_equal( + Nokogiri::XML::DocumentFragment, + Nokogiri::XML::Document.new.related_class("DocumentFragment"), + ) + assert_equal( + Nokogiri::HTML4::DocumentFragment, + Nokogiri::HTML4::Document.new.related_class("DocumentFragment"), + ) + if defined?(Nokogiri::HTML5) + assert_equal( + Nokogiri::HTML5::DocumentFragment, + Nokogiri::HTML5::Document.new.related_class("DocumentFragment"), + ) + end + end + end + + describe Nokogiri::XML::Builder do + it "finds the right things" do + assert_equal( + Nokogiri::XML::Document, + Nokogiri::XML::Builder.new.related_class("Document"), + ) + assert_equal( + Nokogiri::HTML4::Document, + Nokogiri::HTML4::Builder.new.related_class("Document"), + ) + if defined?(Nokogiri::HTML5) + assert_equal( + Nokogiri::HTML5::Document, + Nokogiri::HTML5::Builder.new.related_class("Document"), + ) + end + end + end + + describe Nokogiri::XML::SAX::Parser do + it "finds the right things" do + assert_equal( + Nokogiri::XML::SAX::ParserContext, + Nokogiri::XML::SAX::Parser.new.related_class("ParserContext"), + ) + assert_equal( + Nokogiri::HTML4::SAX::ParserContext, + Nokogiri::HTML4::SAX::Parser.new.related_class("ParserContext"), + ) + end + end +end diff --git a/test/xml/sax/test_parser.rb b/test/xml/sax/test_parser.rb index 98640dea2e..b70d7b0f5c 100644 --- a/test/xml/sax/test_parser.rb +++ b/test/xml/sax/test_parser.rb @@ -215,9 +215,109 @@ class TestCase end end - it "has correct encoding" do - parser = Nokogiri::XML::SAX::Parser.new(Doc.new, "UTF-8") - assert_equal("UTF-8", parser.encoding) + describe "encoding" do + # proper ISO-8859-1 encoding + let(:xml_encoding_iso8859) { "\nB\xF6hnhardt" } + # this input string is really UTF-8 but is marked as ISO-8859-1 + let(:xml_encoding_broken) { "\nBöhnhardt" } + # this input string is really ISO-8859-1 but is marked as UTF-8 + let(:xml_encoding_broken2) { "\nB\xF6hnhardt" } + + it "is nil by default to indicate encoding should be autodetected" do + parser = Nokogiri::XML::SAX::Parser.new(Doc.new) + assert_nil(parser.encoding) + end + + it "can be set in the initializer" do + assert_equal("UTF-8", Nokogiri::XML::SAX::Parser.new(Doc.new, "UTF-8").encoding) + assert_equal("ISO-2022-JP", Nokogiri::XML::SAX::Parser.new(Doc.new, "ISO-2022-JP").encoding) + end + + it "raises when given an invalid encoding name" do + assert_raises(ArgumentError) do + Nokogiri::XML::SAX::Parser.new(Doc.new, "not an encoding").parse_io(StringIO.new("")) + end + assert_raises(ArgumentError) do + Nokogiri::XML::SAX::Parser.new(Doc.new, "not an encoding").parse_memory("") + end + assert_raises(ArgumentError) { parser.parse_io(StringIO.new(""), "not an encoding") } + assert_raises(ArgumentError) { parser.parse_memory("", "not an encoding") } + end + + it "autodetects the encoding if not overridden" do + parser = Nokogiri::XML::SAX::Parser.new(Doc.new) + parser.parse(xml_encoding_iso8859) + + # correctly converted the input ISO-8859-1 to UTF-8 for the callback + assert_equal("Böhnhardt", parser.document.data.join) + end + + it "overrides the ISO-8859-1 document's encoding when set via initializer" do + if Nokogiri.uses_libxml?("< 2.12.0") # gnome/libxml2@ec7be506 + skip("older libxml2 encoding detection is sus") + end + + # broken encoding! + parser = Nokogiri::XML::SAX::Parser.new(Doc.new) + parser.parse(xml_encoding_broken) + + assert_equal("Böhnhardt", parser.document.data.join) + + # override the encoding + parser = Nokogiri::XML::SAX::Parser.new(Doc.new, "UTF-8") + parser.parse(xml_encoding_broken) + + assert_equal("Böhnhardt", parser.document.data.join) + end + + it "overrides the UTF-8 document's encoding when set via initializer" do + if Nokogiri.uses_libxml?(">= 2.13.0") + # broken encoding! + parser = Nokogiri::XML::SAX::Parser.new(Doc.new) + parser.parse(xml_encoding_broken2) + + assert(parser.document.errors.any? { |e| e.match(/Invalid byte/) }) + end + + # override the encoding + parser = Nokogiri::XML::SAX::Parser.new(Doc.new, "ISO-8859-1") + parser.parse(xml_encoding_broken2) + + assert_equal("Böhnhardt", parser.document.data.join) + refute(parser.document.errors.any? { |e| e.match(/Invalid byte/) }) + end + + it "can be set via parse_io" do + if Nokogiri.uses_libxml?("< 2.13.0") + skip("older libxml2 encoding detection is sus") + end + + parser = Nokogiri::XML::SAX::Parser.new(Doc.new) + parser.parse_io(StringIO.new(xml_encoding_broken), "UTF-8") + + assert_equal("Böhnhardt", parser.document.data.join) + + parser = Nokogiri::XML::SAX::Parser.new(Doc.new) + parser.parse_io(StringIO.new(xml_encoding_broken2), "ISO-8859-1") + + assert_equal("Böhnhardt", parser.document.data.join) + end + + it "can be set via parse_memory" do + if Nokogiri.uses_libxml?("< 2.12.0") # gnome/libxml2@ec7be506 + skip("older libxml2 encoding detection is sus") + end + + parser = Nokogiri::XML::SAX::Parser.new(Doc.new) + parser.parse_memory(xml_encoding_broken, "UTF-8") + + assert_equal("Böhnhardt", parser.document.data.join) # here + + parser = Nokogiri::XML::SAX::Parser.new(Doc.new) + parser.parse_memory(xml_encoding_broken2, "ISO-8859-1") + + assert_equal("Böhnhardt", parser.document.data.join) + end end it "error strings are UTF-8" do @@ -294,11 +394,6 @@ class TestCase end end - it "raises when given an invalid encoding name" do - assert_raises(ArgumentError) { Nokogiri::XML::SAX::Parser.new(Doc.new, "not an encoding") } - assert_raises(ArgumentError) { parser.parse_io(StringIO.new(""), "not an encoding") } - end - it "cdata_block is called when CDATA is parsed" do parser.parse_memory(<<~XML)

diff --git a/test/xml/sax/test_parser_context.rb b/test/xml/sax/test_parser_context.rb index 8a9a40935c..aa1ee2821b 100644 --- a/test/xml/sax/test_parser_context.rb +++ b/test/xml/sax/test_parser_context.rb @@ -3,130 +3,227 @@ require "helper" -module Nokogiri - module XML - module SAX - class TestParserContext < Nokogiri::SAX::TestCase - def setup - super - @xml = <<~EOF - - - world - - - - - - - EOF +module Nokogiri::XML::SAX + class TestCounter < Nokogiri::XML::SAX::Document + attr_accessor :context, :lines, :columns + + def initialize + super + @context = nil + @lines = [] + @columns = [] + end + + def start_element(name, attrs = []) + @lines << [name, context.line] + @columns << [name, context.column] + end + end + + describe Nokogiri::XML::SAX::ParserContext do + let(:xml) { <<~XML } + + + world + + + + + + + XML + + describe "constructors" do + describe ".new" do + it "handles IO" do + ctx = ParserContext.new(StringIO.new("fo"), "UTF-8") + assert(ctx) end - class Counter < Nokogiri::XML::SAX::Document - attr_accessor :context, :lines, :columns + it "handles String" do + assert(ParserContext.new("blah blah")) + end + end - def initialize - super - @context = nil - @lines = [] - @columns = [] - end + it ".file" do + ctx = ParserContext.file(Nokogiri::TestCase::XML_FILE) + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + assert_nil(ctx.parse_with(parser)) + end + + it "graceful_handling_of_invalid_types" do + assert_raises(TypeError) { ParserContext.new(0xcafecafe) } + assert_raises(TypeError) { ParserContext.memory(0xcafecafe) } + assert_raises(TypeError) { ParserContext.io(0xcafecafe) } + assert_raises(TypeError) { ParserContext.io(0xcafecafe) } + end + + describe "encoding" do + # this input string is really ISO-8859-1 but is marked as UTF-8 + let(:xml_encoding_broken2) { "\nB\xF6hnhardt" } - def start_element(name, attrs = []) - @lines << [name, context.line] - @columns << [name, context.column] + it "gracefully handles nonsense encodings" do + assert_raises(ArgumentError) do + ParserContext.io(StringIO.new("asdf"), "not-an-encoding") + end + assert_raises(ArgumentError) do + ParserContext.memory("asdf", "not-an-encoding") + end + assert_raises(ArgumentError) do + ParserContext.file(Nokogiri::TestCase::XML_FILE, "not-an-encoding") end end - def test_line_numbers - sax_handler = Counter.new + describe ".io" do + it "supports passing encoding name" do + pc = ParserContext.io(StringIO.new(xml_encoding_broken2), "ISO-8859-1") + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) - parser = Nokogiri::XML::SAX::Parser.new(sax_handler) - parser.parse(@xml) do |ctx| - sax_handler.context = ctx + assert_equal("Böhnhardt", parser.document.data.join) end - assert_equal( - [["hello", 1], ["inter", 4], ["net", 5]], - sax_handler.lines, - ) - end - - def test_column_numbers - sax_handler = Counter.new + it "supports passing Encoding" do + pc = ParserContext.io(StringIO.new(xml_encoding_broken2), Encoding::ISO_8859_1) + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) - parser = Nokogiri::XML::SAX::Parser.new(sax_handler) - parser.parse(@xml) do |ctx| - sax_handler.context = ctx + assert_equal("Böhnhardt", parser.document.data.join) end - assert_equal( - [["hello", 7], ["inter", 7], ["net", 9]], - sax_handler.columns, - ) - end + it "supports passing libxml2 encoding id" do + enc = nil + assert_output(nil, /deprecated/) do + enc = Parser::ENCODINGS["ISO-8859-1"] + end + + pc = nil + assert_output(nil, /deprecated/) do + pc = ParserContext.io(StringIO.new(xml_encoding_broken2), enc) + end - def test_replace_entities - pc = ParserContext.new(StringIO.new(""), "UTF-8") - pc.replace_entities = false - refute(pc.replace_entities) + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) - pc.replace_entities = true - assert(pc.replace_entities) + assert_equal("Böhnhardt", parser.document.data.join) + end end - def test_recovery - pc = ParserContext.new(StringIO.new(""), "UTF-8") - pc.recovery = false - refute(pc.recovery) + describe ".memory" do + it "supports passing encoding name" do + pc = ParserContext.memory(xml_encoding_broken2, "ISO-8859-1") + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) - pc.recovery = true - assert(pc.recovery) - end + assert_equal("Böhnhardt", parser.document.data.join) + end - def test_graceful_handling_of_invalid_types - assert_raises(TypeError) { ParserContext.new(0xcafecafe) } - assert_raises(TypeError) { ParserContext.memory(0xcafecafe) } - assert_raises(TypeError) { ParserContext.io(0xcafecafe, 1) } - assert_raises(TypeError) { ParserContext.io(StringIO.new("asdf"), "should be an index into ENCODINGS") } - end + it "supports passing Encoding" do + pc = ParserContext.memory(xml_encoding_broken2, Encoding::ISO_8859_1) + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) - def test_from_io - ctx = ParserContext.new(StringIO.new("fo"), "UTF-8") - assert(ctx) + assert_equal("Böhnhardt", parser.document.data.join) + end end - def test_from_string - assert(ParserContext.new("blah blah")) - end + describe ".file" do + let(:file) do + Tempfile.new.tap do |f| + f.write xml_encoding_broken2 + f.close + end + end - def test_parse_with - ctx = ParserContext.new("blah") - assert_raises(ArgumentError) do - ctx.parse_with(nil) + it "supports passing encoding name" do + pc = ParserContext.file(file.path, "ISO-8859-1") + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) + + assert_equal("Böhnhardt", parser.document.data.join) end - end - def test_parse_with_sax_parser - xml = "" - ctx = ParserContext.new(xml) - parser = Parser.new(Doc.new) - assert_nil(ctx.parse_with(parser)) - end + it "supports passing Encoding" do + pc = ParserContext.file(file.path, Encoding::ISO_8859_1) + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + pc.parse_with(parser) - def test_from_file - ctx = ParserContext.file(XML_FILE) - parser = Parser.new(Doc.new) - assert_nil(ctx.parse_with(parser)) + assert_equal("Böhnhardt", parser.document.data.join) + end end + end + end - def test_parse_with_returns_nil - xml = "" - ctx = ParserContext.new(xml) - parser = Parser.new(Doc.new) - assert_nil(ctx.parse_with(parser)) + describe "#parse_with" do + it "raises when passed nil" do + ctx = ParserContext.new("blah") + + assert_raises(ArgumentError) do + ctx.parse_with(nil) end end + + it "parses when passed a sax parser" do + xml = "" + ctx = ParserContext.new(xml) + parser = Parser.new(Nokogiri::SAX::TestCase::Doc.new) + + assert_nil(ctx.parse_with(parser)) + assert(parser.document.start_document_called) + assert(parser.document.end_document_called) + end + end + + it "line_numbers" do + sax_handler = TestCounter.new + + parser = Nokogiri::XML::SAX::Parser.new(sax_handler) + parser.parse(xml) do |ctx| + sax_handler.context = ctx + end + + assert_equal( + [["hello", 1], ["inter", 4], ["net", 5]], + sax_handler.lines, + ) + end + + it "column_numbers" do + sax_handler = TestCounter.new + + parser = Nokogiri::XML::SAX::Parser.new(sax_handler) + parser.parse(xml) do |ctx| + sax_handler.context = ctx + end + + assert_equal( + [["hello", 7], ["inter", 7], ["net", 9]], + sax_handler.columns, + ) + end + + describe "attributes" do + it "#replace_entities" do + pc = ParserContext.new(StringIO.new(""), "UTF-8") + pc.replace_entities = false + + refute(pc.replace_entities) + + pc.replace_entities = true + + assert(pc.replace_entities) + end + + it "#recovery" do + pc = ParserContext.new(StringIO.new(""), "UTF-8") + pc.recovery = false + + refute(pc.recovery) + + pc.recovery = true + + assert(pc.recovery) + end end end end