Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. codes are no longer being read correctly #10

Closed
searls opened this issue Mar 12, 2024 · 13 comments · Fixed by #11
Closed

Misc. codes are no longer being read correctly #10

searls opened this issue Mar 12, 2024 · 13 comments · Fixed by #11

Comments

@searls
Copy link
Owner

searls commented Mar 12, 2024

It appears that under the latest nokogiri (v1.16.2), misc. tags of meanings in JMDict are no longer being parsed correctly.

(Last known to work under nokogiri@1.13.3)

Take this meaning from JMDict item #1224880, よろしい. It should have the tag "uk" which indicates "usually kana"

#<Eiwa::Tag::Meaning:0x0000000126f3b908
 @antonyms=[],
 @attrs={},
 @characters="",
 @comments=[],
 @cross_references=[],
 @definitions=
  [#<Eiwa::Tag::Definition:0x00000001239d4760
    @attrs={"xml:lang"=>"eng"},
    @characters="good",
    @gender=nil,
    @language="eng",
    @parent=#<Eiwa::Tag::Meaning:0x0000000126f3b908 ...>,
    @tag_name="gloss",
    @text="good",
    @type=nil>,
   #<Eiwa::Tag::Definition:0x00000001239d4440
    @attrs={"xml:lang"=>"eng"},
    @characters="OK",
    @gender=nil,
    @language="eng",
    @parent=#<Eiwa::Tag::Meaning:0x0000000126f3b908 ...>,
    @tag_name="gloss",
    @text="OK",
    @type=nil>,
   #<Eiwa::Tag::Definition:0x00000001239d2f50
    @attrs={"xml:lang"=>"eng"},
    @characters="all right",
    @gender=nil,
    @language="eng",
    @parent=#<Eiwa::Tag::Meaning:0x0000000126f3b908 ...>,
    @tag_name="gloss",
    @text="all right",
    @type=nil>,
   #<Eiwa::Tag::Definition:0x00000001239d2b40
    @attrs={"xml:lang"=>"eng"},
    @characters="fine",
    @gender=nil,
    @language="eng",
    @parent=#<Eiwa::Tag::Meaning:0x0000000126f3b908 ...>,
    @tag_name="gloss",
    @text="fine",
    @type=nil>,
   #<Eiwa::Tag::Definition:0x00000001239d2870
    @attrs={"xml:lang"=>"eng"},
    @characters="very well",
    @gender=nil,
    @language="eng",
    @parent=#<Eiwa::Tag::Meaning:0x0000000126f3b908 ...>,
    @tag_name="gloss",
    @text="very well",
    @type=nil>,
   #<Eiwa::Tag::Definition:0x00000001239d1a10
    @attrs={"xml:lang"=>"eng"},
    @characters="will do",
    @gender=nil,
    @language="eng",
    @parent=#<Eiwa::Tag::Meaning:0x0000000126f3b908 ...>,
    @tag_name="gloss",
    @text="will do",
    @type=nil>,
   #<Eiwa::Tag::Definition:0x00000001239d0610
    @attrs={"xml:lang"=>"eng"},
    @characters="may",
    @gender=nil,
    @language="eng",
    @parent=#<Eiwa::Tag::Meaning:0x0000000126f3b908 ...>,
    @tag_name="gloss",
    @text="may",
    @type=nil>,
   #<Eiwa::Tag::Definition:0x00000001239d04d0
    @attrs={"xml:lang"=>"eng"},
    @characters="can",
    @gender=nil,
    @language="eng",
    @parent=#<Eiwa::Tag::Meaning:0x0000000126f3b908 ...>,
    @tag_name="gloss",
    @text="can",
    @type=nil>],
 @dialects=[],
 @fields=[],
 @misc_tags=
  [#<Eiwa::Tag::Entity:0x00000001239d49e0
    @attrs={},
    @code=nil,
    @parent=#<Eiwa::Tag::Meaning:0x0000000126f3b908 ...>,
    @tag_name="misc",
    @text=nil>,
   #<Eiwa::Tag::Entity:0x00000001239d48f0
    @attrs={},
    @code=nil,
    @parent=#<Eiwa::Tag::Meaning:0x0000000126f3b908 ...>,
    @tag_name="misc",
    @text=nil>],
 @parent=
  #<Eiwa::Tag::Entry:0x00000001239d4ee0
   @attrs={},
   @characters="",
   @id=1224880,
   @meanings=[#<Eiwa::Tag::Meaning:0x0000000126f3b908 ...>],
   @parent=#<Eiwa::Tag::Other:0x000000012331aaf0 @attrs={}, @characters="", @parent=nil, @tag_name="JMdict">,
   @readings=
    [#<Eiwa::Tag::Reading:0x00000001239d4cb0
      @attrs={},
      @characters="",
      @frequency_tags=[:ichi1],
      @imprecise_reading=false,
      @info_tags=[],
      @parent=#<Eiwa::Tag::Entry:0x00000001239d4ee0 ...>,
      @tag_name="r_ele",
      @text="よろしい">],
   @spellings=
    [#<Eiwa::Tag::Spelling:0x00000001239d4df0
      @attrs={},
      @characters="",
      @frequency_tags=[:ichi1],
      @info_tags=[],
      @parent=#<Eiwa::Tag::Entry:0x00000001239d4ee0 ...>,
      @tag_name="k_ele",
      @text="宜しい">],
   @tag_name="entry">,
 @parts_of_speech=
  [#<Eiwa::Tag::Entity:0x00000001239d4b20
    @attrs={},
    @code=nil,
    @parent=#<Eiwa::Tag::Meaning:0x0000000126f3b908 ...>,
    @tag_name="pos",
    @text=nil>],
 @restricted_to_readings=[],
 @restricted_to_spellings=[],
 @source_languages=[],
 @tag_name="sense">

But instead, as the printout of its sole meaning above shows, the code of both misc tags is now coming back as nil:

> entry.meanings.first.misc_tags.map(&:code)
=> [nil, nil]
@searls
Copy link
Owner Author

searls commented Mar 12, 2024

Okay, digging in here a little bit more, I can see some specific evidence of behavior differences between 1.15.2 and 1.16.0

For this XML entry from the repo's abridged fixture XML file:

<entry>
<ent_seq>1014660</ent_seq>
<r_ele>
<reb>アウタルキー</reb>
</r_ele>
<sense>
<pos>&n;</pos>
<lsource xml:lang="ger">Autarkie</lsource>
<gloss>autarchy</gloss>
</sense>
</entry>

When the <lsource> tag is processed in the doc (which is an instance of Nokogiri::XML::SAX::Document) by the start_element callback:

      def start_element(name, attrs)
        parent = @current
        @current = (TAGS[name] || Tag::Other).new
        @current.start(name, attrs, parent)
      end

The attrs value differs.

In nokogiri@1.15.5, attrs is:

> attrs
=> [["xml:lang", "ger"]]
> attrs.to_h
=> {"xml:lang"=>"ger"}

In nokogiri@1.16.0 and later:

> attrs
=> [["xml:lang", "ger"], ["xml:lang", "eng"]]
> attrs.to_h
=> {"xml:lang"=>"eng"}

Since xml:lang="eng" is not an attribute on the lsource element, I'm tempted to think this is a nokogiri bug, unless there's an XML namespaces rule that I'm overlooking (which is extremely possible). In the DTD at the top, it does define "eng" as the default value, perhaps the issue is that the default is being copied even when it's being explicitly set?

<!ATTLIST lsource xml:lang CDATA "eng">

@searls
Copy link
Owner Author

searls commented Mar 12, 2024

hey @flavorjones, I'm sorry to bug you by asking you if you might take a look at this for me, but it's been so many years since I did XML work very seriously that at a glance it's hard for me to tell if this is a me-issue or a nokogiri+SAX+DTD issue in parsing this kinda goofy document.

As it stands, it seems like a whole bunch of these tags are being parsed with incorrect attributes under 1.16.0 and up

@flavorjones
Copy link
Contributor

Sure! I'll take a look in the morning.

@searls
Copy link
Owner Author

searls commented Mar 12, 2024

Thanks Mike. As I dig into this this morning, I worry that the topic of the issue (my more immediate problem, having to do with entity parsing, maybe) is probably unrelated to my potential nokogiri bug in this comment, so feel free to just focus on the comment for now so as not to get confused. I'll keep trying to isolate the entity issue

@searls
Copy link
Owner Author

searls commented Mar 12, 2024

@flavorjones as for the original issue, I believe this is the cause. Filed on nokogiri sparklemotion/nokogiri#3147

@flavorjones
Copy link
Contributor

grr you beat me by two minutes. here's my repro:

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", "~>1.15.0"
end

class Document < Nokogiri::XML::SAX::Document
  def start_element(name, attrs)
    puts "#{__FILE__}:#{__LINE__}:#{__method__}: name=#{name.inspect}, attrs=#{attrs.inspect}"
  end
end

fixture = <<~XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [
  <!ATTLIST foo xml:lang CDATA "eng">
]>
<root>
  <foo xml:lang="ger">Ja</foo>
</root>
XML

parser = Nokogiri::XML::SAX::Parser.new(Document.new)
parser.parse(fixture)

# with nokogiri < 1.16.0:
# ./10-sax-issue.rb:12:start_element: name="root", attrs=[]
# ./10-sax-issue.rb:12:start_element: name="foo", attrs=[["xml:lang", "ger"]]
# 
# with nokogiri >= 1.16.0:
# ./10-sax-issue.rb:12:start_element: name="root", attrs=[]
# ./10-sax-issue.rb:12:start_element: name="foo", attrs=[["xml:lang", "ger"], ["xml:lang", "eng"]]

@flavorjones
Copy link
Contributor

@searls that bug report seems valid, but isn't the problem you're seeing here with xml:lang.

I've opened an upstream issue in Nokogiri at sparklemotion/nokogiri#3148

@searls
Copy link
Owner Author

searls commented Mar 12, 2024

Yep, it's two separate issues! You repro'd the one I asked you to, and I was digging into the other one (the original reason I opened this issue, but overnight realized my first comment where i tagged you was a red herring / separate problem)

@flavorjones
Copy link
Contributor

Nokogiri v1.16.3 is out which fixes the attributes issue in the start_element callback. https://github.com/sparklemotion/nokogiri/releases/tag/v1.16.3

@searls
Copy link
Owner Author

searls commented Mar 16, 2024

Awesome! I'll check this out soon

@searls
Copy link
Owner Author

searls commented Mar 16, 2024

Can confirm! The attribute issue is indeed fixed!

@flavorjones
Copy link
Contributor

@searls I've described a possible fix to Nokogiri for the entity errors at sparklemotion/nokogiri#1926 that would require some changes in how eiwa works (also described there).

@searls
Copy link
Owner Author

searls commented Mar 19, 2024

@flavorjones Thanks! left a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants