-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer encoding #19
Buffer encoding #19
Conversation
…ou'll need binary for payload
…force binary here
@@ -29,14 +29,16 @@ def initialize(ios) | |||
# | |||
# @return [String, nil] | |||
def read(length = nil, outbuf = nil) | |||
outbuf = outbuf.to_s.replace("") | |||
outbuf = outbuf.to_s.clear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though this is likely a bug in the previous version as well, #to_s
may make a new object, at which point the purpose of passing outbuf
is defeated.
Perhaps give outbuf a default value (e.g. outbuf = "".b
instead of outbuf = nil
as the argument definition), and raise if it's not a String
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#to_s will not create a new object if the object is a string. But the replace call might change its encoding. That was the reasoning behind it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hence "may make a new object". We never want to make a new object for the buffer... that defeats the point.
How about raising if the supplied buffer is not a String
type with Encoding::BINARY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never want to make a new object for the buffer... that defeats the point.
ahh, I see what you mean.
How about raising if the supplied buffer is ...
I'm still not sure. The encoding requirement is usually not mandatory, and the buffer passed may be a specialized buffer. Besides "fixing" jruby, I'm not even sure that forcing binary encoding there is the right thing to do.
By the way, what's the reasoning behind allowing length = nil
? The length argument is usually required for IO#read
and File#read
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the buffer passed may be a specialized buffer
In what case would it make sense for it to be anything but a String
? Ultimately it needs to thunk to a Ruby I/O function which also accepts a preallocated buffer, and those all need to be a String
unless JRuby has some fancy behavior I don't know about where it lets you pass a Java ByteBuffer or something.
Besides "fixing" jruby, I'm not even sure that forcing binary encoding there is the right thing to do.
For these types of buffers, I think mandating Encoding::BINARY
is the only right option. They should really be a ByteArray
/ByteBuffer
type, but Ruby doesn't have those, so we need the next best thing, which is a String
with Encoding::BINARY
.
For other encodings there are all sorts of pathological performance implications, and the contract at this point in the code is to handle arbitrary binary data that may not be whatever-encoding-was-passed-in.
Given this is a rarely used power-user feature, I think it should have some pretty exacting semantics, which are:
- Avoid all allocations. The whole point of this API is to avoid allocations. This means no
#to_s
or anything else that might potentially allocate memory String
is probably the only argument type that makes senseEncoding::BINARY
(i.e.Encoding::ASCII_8BIT
) is probably the only sensible encoding for the bufferString
. If you think other encodings should be supported, do you really think the answer to the question "Willoutbuf
always be a valid string under encoding X?" is yes?
By the way, what's the reasoning behind allowing length = nil?
No idea, like the outbuf
being nil
I think that probably doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this change, I'll try to answer the questions below.
@janko-m Please, chip in. |
I'll try to make some clarifications on the implementation and behaviour.
The reason we're creating a new object is because when we're reading the chunks of internal current_io.read(length, @buffer) Since Ruby's
Ruby's
Good call, [1] pry(main)> "".b.replace("").encoding
=> #<Encoding:UTF-8>
Both |
@janko-m thank you for the clarification. I had the impression that this was to follow Besides the buffer issue, anything else you'd like to add to the PR? I'd consider it ready to go otherwise. |
Gemfile
Outdated
@@ -18,7 +18,7 @@ group :test do | |||
end | |||
|
|||
group :doc do | |||
gem "redcarpet" | |||
gem "redcarpet", platform: :mri |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Travis CI is blowing up on this line, because Rubocop for the project is requiring hash rockets. I am guessing that's a personal preference for the project, as I prefer Ruby 1.9+ colon syntax...
Can we get this CI issue resolved? I think my usage of the gem may depend on this change as well. |
@nightsurge there's an open upstream issue re: keyword arguments httprb/http#447 |
@tarcieri as far as I thought kwargs issue was about changing options hashes to kwargs. But not about changing hash-rocket usage to new hash style. :)) I'm still hard follower of hash rockets stile as it keeps code much more consistent :D |
Will review/merge this pr asap |
@HoneyryderChuck I have some oddities in my code: 10: def initialize(ios)
11: @index = 0
12: @buffer = String.new
=> 13: binding.pry
14: @ios = ios.map do |io|
15: if io.is_a?(String)
16: StringIO.new(io)
17: elsif io.respond_to?(:read)
18: io
19: else
20: raise ArgumentError,
21: "#{io.inspect} is neither a String nor an IO object"
22: end
23: end
24: end
[1] pry(#<HTTP::FormData::CompositeIO>)> @buffer.encoding
=> #<Encoding:ASCII-8BIT>
[2] pry(#<HTTP::FormData::CompositeIO>)> String.new.encoding
=> #<Encoding:ASCII-8BIT> @buffer/String.new appears to be ASCII-8BIT, which is correct. Also, when it gets to the io handling, I have to add a force_encoding(Encoding::BINARY) to both the outbuf and @buffer. I'm just doing standard image uploads of various types (png/jpg). def read(length = nil, outbuf = nil)
outbuf = outbuf.to_s.replace("")
outbuf.force_encoding(Encoding::BINARY)
while current_io
current_io.read(length, @buffer)
outbuf << @buffer.force_encoding(Encoding::BINARY) I'm using JRuby 9.1.15.0 |
I feel that this might be a jruby bug. You shouldn't have to force-encode between two binary string, or I don't see the edge case where you should. |
@HoneyryderChuck agreed. I wonder if I can setup a test script using io.read and a sample file to reproduce the issue. Is it detrimental to performance to use both force_encodings? I hate to have to monkeypatch the gem to keep it working for our enterprise app. But then again it is silly to ask a gem to sacrifice potential performance for a JRuby bug. Both scenarios are potentially just temporary. |
@nightsurge I'd say that you could easily turn it into a reproducible script (in theory, at least), as you're just reading chunks into a buffer. I'd suggest you to present this in the jruby issue tracker, and they're usually quite responsive folks. |
Ruby strings can sometimes change encodings depending on what you append to them. This is not JRuby-specific, it's the same on MRI. require "securerandom"
string = ""
string.encoding # => #<Encoding:UTF-8>
string << SecureRandom.random_bytes
string # => ",\xCA\xBA\xEE<E\xB0\x12\xB0+\xD8\x1Ebh}\e"
string.encoding # => #<Encoding:ASCII-8BIT> We can see that when we appended a binary string to an UTF-8 encoded string, its encoding automatically changed to binary as well.
I would be surprised if it is, because If the goal is to have all string returned in binary encoding, which if you ask me is definitely a safe option, we should add |
That sounds safest. I was definitely seeing the same behavior in the script
I wrote for both MRI/C and Jruby.
…On Jan 21, 2018 10:33 AM, "Janko Marohnić" ***@***.***> wrote:
I have to add a force_encoding(Encoding::BINARY) to both the outbuf and
@buffer <https://github.com/buffer>
I feel that this might be a jruby bug. You shouldn't have to force-encode
between two binary string, or I don't see the edge case where you should.
Ruby strings can sometimes change encodings depending on what you append
to them. This is not JRuby-specific, it's the same on MRI.
require "securerandom"
string = ""
string.encoding # => #<Encoding:UTF-8>
string << SecureRandom.random_bytes
string # => ",\xCA\xBA\xEE<E\xB0\x12\xB0+\xD8\x1Ebh}\e"
string.encoding # => #<Encoding:ASCII-8BIT>
We can see that when we appended a binary string to an UTF-8 encoded
string, its encoding automatically changed to binary as well.
Is it detrimental to performance to use both force_encodings?
I would be surprised if it is, because force_encoding doesn't change the
bytes, it only changes the instructions how those bytes should be
interpreted (the encoding). If the goal is to have all string returned in
binary encoding, which if you ask me is definitely a safe option, we should
add .force_encoding(Encoding::BINARY) calls whenever we're appending to
outbuf.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AC7IGi1XQIqJqtaSVPZwEhP8ag_NFS9Gks5tM1jcgaJpZM4RFWXI>
.
|
@janko-m the discussed case (if I remember correctly) was where the binary chunk was being appended to the binary I could also just easily employ the fix and add the |
Oh, that definitely seems like a bug then. But we'd need the
We know that I would like to see this merged, because that fact that currently |
@HoneyryderChuck can you please add specs (if it's feasible)? |
@ixti I'll try to come up with smth, + the force_encoding call. But can we get a reproducible script to present the bug case to the jruby guys? @nightsurge any inputs? |
@HoneyryderChuck @ixti @janko-m I already posted to JRuby and they confirmed that since it was also a bug with MRI/C-Ruby, there was nothing they planned to do at this point and closed the issue: (script is in that issue post) |
@ixti decided not to write any test, as composite_io module tests are following a pattern I find hard to add other ios into. Also, I wouldn't know how to test the specific case of the bytesize call. If you think this is important, I'd suggest you to change one of the ios strings to a different encoding, although that's up to you. @nightsurge did you file a bug report in ruby tracker? I'd like to follow that. |
@HoneyryderChuck That's ok, since I wrote that bit I can add the tests after the PR is merged. |
@HoneyryderChuck Thank you! |
@janko-m Oh, right!. Seems like so. Will handle this today! Thanks for the reminder. |
@janko-m v2.1.0 is here :D |
@ixti Awesome, thank you! |
## 2.1.0 (2018-03-05) * [#21](httprb/form_data#21) Rewind content at the end of `Readable#to_s`. [@janko-m][] * [#19](httprb/form_data#19) Fix buffer encoding. [@HoneyryderChuck][] ## 2.0.0 (2017-10-01) * [#17](httprb/form_data#17) Add CRLF character to end of multipart body. [@mhickman][] ## 2.0.0.pre2 (2017-05-11) * [#14](httprb/form_data#14) Enable streaming for urlencoded form data. [@janko-m][] ## 2.0.0.pre1 (2017-05-10) * [#12](httprb/form_data#12) Enable form data streaming. [@janko-m][]
This should fix #18