Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Peeking at the beginning of the body (without getting the whole thing) #215

Open
epitron opened this issue Mar 16, 2012 · 5 comments

Comments

@epitron
Copy link

epitron commented Mar 16, 2012

I've found myself recently needing the ability to read just the beginning of an HTTP response (just the first 1k), and it doesn't seem like Mechanize is currently outfitted to do this.

Pluggable Parsers let me get a body_io, so I tried to read just the first 400 bytes and then close it. No dice, unfortunately! Mechanize ends up downloading the whole thing!

Is there an easy way to do this? Or would this be a nasty hack?

@drbrain
Copy link
Member

drbrain commented May 2, 2012

Depending upon server support, the Range header can do this.

This retrieves the first 10 bytes:

require 'mechanize'

agent = Mechanize.new
page = agent.get "http://localhost", [], nil, 'Range' => 'bytes=0-9'
p page.body

Apache and WEBrick support range requests, google does not.

For one of the servers I tested, gzip encoding caused an indecipherable response body. I'm not sure if it's a bug of the server or not, though.

Mechanize does not copy headers across redirects at this time, I will fix that.

Adding fetching of the first 400 bytes is possible, but I'm unsure about how to create good API for it.

@epitron
Copy link
Author

epitron commented May 3, 2012

Cool! Thanks, Dr. B! The range query is a good stop-gap measure. (Good for people who want to resume downloads, as well.)

I think the API for reading the first n bytes could be really simple -- just yield the body a chunk at a time as the data streams in. If the block terminates, then the transfer could also terminate. Something like:

agent.get(whatever).stream_body(blocksize=4096) do |chunk|
  puts "YAY I GOT A CHUNK!! (#{chunk.size} bytes)"
  break
end

Would that be possible, given Mechanize's current structure?

@drbrain
Copy link
Member

drbrain commented May 7, 2012

The greater problem is designing a good API for this. agent.get(uri).stream_body would not work since by the time stream_body is called the body has been downloaded. Changing Mechanize#get to return a promise is too radical.

Changing Mechanize#get to stream if a block is given seems too constricting. Yielding a response object which the user can use to stream like Net::HTTP streaming seems too complicated for Mechanize:

connection.request req do |res|
  res.read_body do |chunk|
    # …
  end
end

So I think different API entirely would be needed.

The technical problems are minor, I foresee having to deal with content-encoding compression (streaming decompression must be added) and proper shutdown of persistent connections (using this feature may reduce performance).

What is your use case for only retrieving the first kilobyte of data?

@drbig
Copy link

drbig commented Mar 19, 2013

This may be related to what I'm missing: can I use it to do a GET request but without ever touching even a bit from body? And no, HEAD request won't do. Also, I want Mechanize to handle all the redirects/authentication/cookies/whatnot - In other words, return me the response as it is at the latest point before Mechanize would start leeching body.

Why:
Some sites require authentication and then serve static files via some akami/aws-type backends, which use more or less meaningless ids and only at the last point of redirects/auths etc. they give you the file name via content-disposition. I want to get to that, check if the file doesn't exists, and if so only then read the body.

@drbig
Copy link

drbig commented Mar 19, 2013

Plus, I may check size via content-length, or maybe some hash if the backend server provides it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants