-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc. codes are no longer being read correctly #10
Comments
Okay, digging in here a little bit more, I can see some specific evidence of behavior differences between 1.15.2 and 1.16.0 For this XML entry from the repo's abridged fixture XML file: <entry>
<ent_seq>1014660</ent_seq>
<r_ele>
<reb>アウタルキー</reb>
</r_ele>
<sense>
<pos>&n;</pos>
<lsource xml:lang="ger">Autarkie</lsource>
<gloss>autarchy</gloss>
</sense>
</entry> When the def start_element(name, attrs)
parent = @current
@current = (TAGS[name] || Tag::Other).new
@current.start(name, attrs, parent)
end The In nokogiri@1.15.5, attrs is: > attrs
=> [["xml:lang", "ger"]]
> attrs.to_h
=> {"xml:lang"=>"ger"} In nokogiri@1.16.0 and later: > attrs
=> [["xml:lang", "ger"], ["xml:lang", "eng"]]
> attrs.to_h
=> {"xml:lang"=>"eng"} Since
|
hey @flavorjones, I'm sorry to bug you by asking you if you might take a look at this for me, but it's been so many years since I did XML work very seriously that at a glance it's hard for me to tell if this is a me-issue or a nokogiri+SAX+DTD issue in parsing this kinda goofy document. As it stands, it seems like a whole bunch of these tags are being parsed with incorrect attributes under 1.16.0 and up |
Sure! I'll take a look in the morning. |
Thanks Mike. As I dig into this this morning, I worry that the topic of the issue (my more immediate problem, having to do with entity parsing, maybe) is probably unrelated to my potential nokogiri bug in this comment, so feel free to just focus on the comment for now so as not to get confused. I'll keep trying to isolate the entity issue |
@flavorjones as for the original issue, I believe this is the cause. Filed on nokogiri sparklemotion/nokogiri#3147 |
grr you beat me by two minutes. here's my repro: #! /usr/bin/env ruby
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "nokogiri", "~>1.15.0"
end
class Document < Nokogiri::XML::SAX::Document
def start_element(name, attrs)
puts "#{__FILE__}:#{__LINE__}:#{__method__}: name=#{name.inspect}, attrs=#{attrs.inspect}"
end
end
fixture = <<~XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [
<!ATTLIST foo xml:lang CDATA "eng">
]>
<root>
<foo xml:lang="ger">Ja</foo>
</root>
XML
parser = Nokogiri::XML::SAX::Parser.new(Document.new)
parser.parse(fixture)
# with nokogiri < 1.16.0:
# ./10-sax-issue.rb:12:start_element: name="root", attrs=[]
# ./10-sax-issue.rb:12:start_element: name="foo", attrs=[["xml:lang", "ger"]]
#
# with nokogiri >= 1.16.0:
# ./10-sax-issue.rb:12:start_element: name="root", attrs=[]
# ./10-sax-issue.rb:12:start_element: name="foo", attrs=[["xml:lang", "ger"], ["xml:lang", "eng"]] |
@searls that bug report seems valid, but isn't the problem you're seeing here with I've opened an upstream issue in Nokogiri at sparklemotion/nokogiri#3148 |
Yep, it's two separate issues! You repro'd the one I asked you to, and I was digging into the other one (the original reason I opened this issue, but overnight realized my first comment where i tagged you was a red herring / separate problem) |
Nokogiri v1.16.3 is out which fixes the attributes issue in the |
Awesome! I'll check this out soon |
Can confirm! The attribute issue is indeed fixed! |
@searls I've described a possible fix to Nokogiri for the entity errors at sparklemotion/nokogiri#1926 that would require some changes in how eiwa works (also described there). |
@flavorjones Thanks! left a comment |
It appears that under the latest nokogiri (v1.16.2), misc. tags of meanings in JMDict are no longer being parsed correctly.
(Last known to work under nokogiri@1.13.3)
Take this meaning from JMDict item #1224880, よろしい. It should have the tag "uk" which indicates "usually kana"
But instead, as the printout of its sole meaning above shows, the code of both misc tags is now coming back as nil:
The text was updated successfully, but these errors were encountered: