Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

version > 1.4.4 produces duplicate elements when using Nokogiri::HTML with an invalid HTML doc #478

Closed
scottkf opened this issue Jun 23, 2011 · 9 comments

Comments

@scottkf
Copy link

scottkf commented Jun 23, 2011

When using version 1.4.4, the following produces the correct results:

Nokogiri::HTML(open(URI.encode("http://store.steampowered.com/search/results?sort_by=Name&sort_order=ASC&category1=998&cc=us&v6=1&page=1")))

1.4.5 yielded duplicates, as well as 1.4.6. I did not try 1.4.4.1 or 1.4.4.2.

I suspect it has to do with validity; the page does not produce valid HTML because it lacks <html> and <body> tags.

Source is here: http://pastie.org/2111018

@flavorjones
Copy link
Member

Hello!

Thanks for asking this question! However, without more information,
Team Nokogiri cannot reproduce your issue, and so we cannot offer much
help.

Please provide us with:

  • The output of nokogiri -v, which will provide details about your
    platform and versions of ruby, libxml2 and nokogiri.
  • More information on how to find the duplicate elements. Cursory
    examination on my platform did not reveal any duplicates.

For more information about requesting help or reporting bugs, please
take a look at http://bit.ly/nokohelp

Thank you so much!

@scottkf
Copy link
Author

scottkf commented Jun 23, 2011

Sorry, locally: nokogiri -v produces:

--- 
warnings: []
nokogiri: 1.4.6
ruby: 
  version: 1.9.2
  platform: x86_64-darwin10.7.3
  description: ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.7.3]
  engine: ruby
libxml: 
  binding: extension
  compiled: 2.7.8
  loaded: 2.7.8

https://gist.github.com/1042949 is a simple test describing the behavior, and it failed locally, on heroku, and another host. Note that visual inspection of the HTML here shows that <a href="http://store.steampowered.com/app/15540/?snr=1_7_7_230_150_1">...</a> only appears once. The source is saved in @src in the gist.

Heroku's Nokogiri::VERSION_INFO is:

>> Nokogiri::VERSION_INFO
=> {"warnings"=>[], "nokogiri"=>"1.4.5", "ruby"=>{"version"=>"1.9.2", "platform"=>"x86_64-linux", "description"=>"ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]", "engine"=>"ruby"}, "libxml"=>{"binding"=>"extension", "compiled"=>"2.7.6", "loaded"=>"2.7.6"}}

The other host:

ruby-1.9.2-p0 > Nokogiri::VERSION_INFO
 => {"warnings"=>[], "nokogiri"=>"1.4.6", "ruby"=>{"version"=>"1.9.2", "platform"=>"x86_64-linux", "description"=>"ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]", "engine"=>"ruby"}, "libxml"=>{"binding"=>"extension", "compiled"=>"2.7.7", "loaded"=>"2.7.7"}}

@ender672
Copy link
Member

I wrote a test that reproduces this issue. Test is here: https://gist.github.com/1049877

@ender672
Copy link
Member

I bisected the issue using the test above. Here is my bisection result:

$ git bisect bad
c39eb4e is the first bad commit
commit c39eb4e
Author: Akinori MUSHA knu@idaemons.org
Date: Wed Dec 29 15:17:30 2010 +0900

Add further encoding detection to HTML parser that libxml2 does not do.

An encoding option specified in an XML declaration is honored (which
should be significant in XHTML), and a charset option specified in a
<meta http-equiv="Content-Type"> tag properly works even if it appears
after an occurrence of non-ASCII characters.

:100644 100644 b521ce3a023941680e78ea2ed9426862d2bc7803 e4b30b3efed08406e18267acd997b2db092fd338 M CHANGELOG.rdoc
:040000 040000 fa568350eab143d5950c2e50dccf84c11f0403c8 bc8d583609896cb2ade81e44cd2be6ec436ea284 M lib
:040000 040000 6c0e496445ffe99ea597a2ffb57117fea8b11507 6fc3af16eaa94ff499c27d92d1480829805033b9 M test

@ender672
Copy link
Member

@knu -- please take a look at 984a554 -- Nokogiri::XML::Document#read_io silently discards IO errors in order to avoid a memory leak.

It looks like c39eb4e needs IO exceptions when reading HTML files.

One option is to rewrite c39eb4e so that it doesn't require IO exceptions.

Another option is to rewrite 984a554 so that exceptions can occur without leaking memory.

@ender672
Copy link
Member

I reverted 984a554 and my test above still fails. Guess it isn't caused by the swallowing of exceptions after all (but do keep that in mind!).

@knu
Copy link
Member

knu commented Jun 29, 2011

OK, I'll look into this later today.

@knu knu closed this as completed Jun 29, 2011
@knu knu reopened this Jun 29, 2011
@ender672
Copy link
Member

I submitted a pull request that fixes the issue here: https://github.com/tenderlove/nokogiri/pull/481

@flavorjones
Copy link
Member

@ender672 thanks for the followup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants