Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAX Parser ignores explicitly set 'UTF-8' encoding and proceeds to reencode the document resulting in double-encoding artifacts #918

Closed
vladgurovich opened this issue Jun 7, 2013 · 7 comments · Fixed by #3288

Comments

@vladgurovich
Copy link

Unlike regular Nokogiri::XML parser which allows you to explicitly set the encoding, Nokogiri::XML::SAX::Parser ignores the explicitly set encoding and uses the one specified in the provided xml(which has already been encoded in UTF-8, but the contents havent changed):

  • the sax handling code
class MyDocument < Nokogiri::XML::SAX::Document
  def characters(string)
    puts "ENCODING: #{string.encoding} VALUE: #{string}"
  end
  def xmldecl(version, encoding, standalone)
    puts "XMLDECL VERSION: #{version} ENCODING: #{encoding} STANDALONE: #{standalone}"
  end
  def warning(string)
    puts "WARNING: #{string}"
  end
end
  • console output when encoding IS NOT specified to the parser:
1.9.3p125 :003 > Nokogiri::VERSION
 => "1.5.9" 
1.9.3p125 :004 > parser = Nokogiri::XML::SAX::Parser.new(MyDocument.new)
 => #<Nokogiri::XML::SAX::Parser:0x007f9c631898e8> 
1.9.3p125 :005 > parser.encoding
 => "UTF-8" 
1.9.3p125 :006 > parser.parse("<?xml version='1.0' encoding='ISO-8859-1' standalone='yes'?>\n<content>Böhnhardt</content>")
XMLDECL VERSION: 1.0 ENCODING:  STANDALONE: yes
ENCODING: UTF-8 VALUE: B
ENCODING: UTF-8 VALUE: öhnhardt   #<<< borked characters 
  • console output when encoding IS specified to the parser:
1.9.3p125 :029 > parser = Nokogiri::XML::SAX::Parser.new(MyDocument.new, 'UTF-8')
 => #<Nokogiri::XML::SAX::Parser:0x007f9c631fd388> 
1.9.3p125 :030 > parser.parse("<?xml version='1.0' encoding='ISO-8859-1' standalone='yes'?>\n<content>Böhnhardt</content>")
XMLDECL VERSION: 1.0 ENCODING:  STANDALONE: yes
ENCODING: UTF-8 VALUE: B
ENCODING: UTF-8 VALUE: öhnhardt  #<< still borked characters
 => nil 

With a regular parser however, when encoding is explicitly specified (per instructions), its respected and double conversion does not occur:

1.9.3p125 :026 > Nokogiri.XML("<?xml version='1.0' encoding='ISO-8859-1' standalone='yes'?>\n<content>Böhnhardt</content>").xpath('//content').text
 => "Böhnhardt" 
 1.9.3p125 :027 > Nokogiri.XML("<?xml version='1.0' encoding='ISO-8859-1' standalone='yes'?>\n<content>Böhnhardt</content>", nil, 'UTF-8').xpath('//content').text
 => "Böhnhardt" 
@vladgurovich
Copy link
Author

related issue: #844

@kml
Copy link

kml commented Aug 16, 2013

It's also problematic on JRuby. XmlSaxParserContext#parse_io ignores encoding parameter
https://github.com/sparklemotion/nokogiri/blob/master/ext/java/nokogiri/XmlSaxParserContext.java#L146

It would be usefull to have possibility to force encoding.

@skatkov
Copy link

skatkov commented Sep 22, 2013

I have reported similar problem on StackOverflow -> http://stackoverflow.com/questions/18947249/using-ruby-sax-parsers-for-gb2312-encoded-xml

@flavorjones
Copy link
Member

This bug is similar to #671 which seems to be an issue with libxml2. What's different is that this report covers parsing in-memory strings, and #671 covers parsing IO, which are two different libxml2 functions. Regardless, they seem related.

@flavorjones
Copy link
Member

Seems like we can use xmlSwitchEncoding to tell libxml2 what the expected encoding will be. I need to update Nokogiri to make sure that argument is treated as an override to the encoding in XML declaration.

flavorjones added a commit that referenced this issue Jul 5, 2024
Previously, this functionality worked fine for `#parse_io` but didn't
work for `#parse_memory`.

This change introduces a new optional encoding parameter to
`SAX::Parser#parse_memory` and `SAX::ParserContext.memory`, and makes
sure to use that encoding or the one passed to the Parser's initializer.

This change also makes optional the encoding_id parameter to
`SAX::ParserContext.io`, which was previously required.

Closes #918
@flavorjones
Copy link
Member

Proposed fix is at #3282

And I have another PR in progress to clean up all of this code and use real encoding names (and not these libxml2-specific encoding IDs).

flavorjones added a commit that referenced this issue Jul 7, 2024
Previously, this functionality worked fine for `#parse_io` but didn't
work for `#parse_memory`.

This change introduces a new optional encoding parameter to
`SAX::Parser#parse_memory` and `SAX::ParserContext.memory`, and makes
sure to use that encoding or the one passed to the Parser's initializer.

This change also makes optional the encoding_id parameter to
`SAX::ParserContext.io`, which was previously required.

Finally, this commit also backfills similar test coverage for the
HTML4 sax parser encoding, which should help with an upcoming big
refactor.

Closes #918
flavorjones added a commit that referenced this issue Jul 7, 2024
Previously, encoding overrides were not implemented for
XML::SAX::Parser#parse_memory (as reported in #918) and
XML::SAX::Parser#parse_file.

However, this commit goes further and significantly simplifies and
unifies the two SAX::ParserContext implementations and the two
SAX::Parser implementations.

This commit also allows Encoding objects and encoding names to be
passed into the SAX::ParserContext methods, and the XML memory and
file methods now accept and properly use passed encodings.

Finally, this commit also backfills a lot of test coverage for the XML
and the HTML4 sax parser encoding.

Closes #918
@flavorjones
Copy link
Member

New proposed fix is at #3288

flavorjones added a commit that referenced this issue Jul 7, 2024
**What problem is this PR intended to solve?**

Previously, encoding overrides were not implemented for
XML::SAX::Parser#parse_memory (as reported in #918) and
XML::SAX::Parser#parse_file.

However, this commit goes further and significantly simplifies and
unifies the two SAX::ParserContext implementations and the two
SAX::Parser implementations.

This commit also allows Encoding objects and encoding names to be passed
into the SAX::ParserContext methods, and the XML memory and file methods
now accept and properly use passed encodings.

Finally, this commit also backfills a lot of test coverage for the XML
and the HTML4 sax parser encoding.

Closes #918


**Have you included adequate test coverage?**

Yes.


**Does this change affect the behavior of either the C or the Java
implementations?**

Yes, but they are more consistent with each other.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants