Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringIO#ungetbyte stores incorrectly #2436

Closed
Nakilon opened this issue Sep 2, 2021 · 3 comments
Closed

StringIO#ungetbyte stores incorrectly #2436

Nakilon opened this issue Sep 2, 2021 · 3 comments
Assignees
Labels
Milestone

Comments

@Nakilon
Copy link

Nakilon commented Sep 2, 2021

$ rbenv shell 2.3.8
$ ruby -e "s = StringIO.new; s.ungetbyte(255); p [s.string, s.string.encoding, s.string.bytes, s.getc]"
["\xFF", #<Encoding:UTF-8>, [255], "\xFF"]

$ rbenv shell 3.0.1
$ ruby -rstringio \
       -e "s = StringIO.new; s.ungetbyte(255); p [s.string, s.string.encoding, s.string.bytes, s.getc]"
["\xFF", #<Encoding:UTF-8>, [255], "\xFF"]

$ rbenv shell truffleruby-21.1.0
$ ruby -e "s = StringIO.new; s.ungetbyte(255); p [s.string, s.string.encoding, s.string.bytes, s.getc]"
["ÿ", #<Encoding:UTF-8>, [195, 191], "ÿ"]

This results in several tests failing in my project.

@Nakilon
Copy link
Author

Nakilon commented Sep 2, 2021

And this one:

$ ruby -rstringio -e "Encoding::default_external = 'ASCII-8BIT'; s = StringIO.new; 1.times{ s.ungetbyte 255 }; puts :OK"
OK
$ ruby -rstringio -e "Encoding::default_external = 'ASCII-8BIT'; s = StringIO.new; 2.times{ s.ungetbyte 255 }; puts :OK"
.../truffleruby-21.1.0/lib/truffle/stringio.rb:615:in `ungetbyte': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
	from -e:1:in `block in <main>'
	from <internal:core> core/integer.rb:148:in `times'
	from -e:1:in `<main>'

@aardvark179 aardvark179 self-assigned this Sep 2, 2021
@bjfish bjfish added the bug label Sep 2, 2021
@aardvark179
Copy link
Contributor

I see the problem. The StringIO#ungetbyte is treating the byte as a character rather than raw bytes. Since we're commonly working in UTF8 this single byte is converted into a multibyte UTF8 encoding, and appended to the front of the string.

To illustrate this consider the string "\u01A9". This is encoded into the byte sequence 0xC60xA9. When a byte is read we get 0xc6, but if we try to unget that byte we append 0xC30x86 to the start of the string, because that's the UTF8 encoding of \u00C6.

This particular piece of our library looks like it can be simplified considerably as ungetbyte will only accept a single number, and will mask it to be a single byte.

@eregon eregon added this to the 21.3.0 milestone Sep 6, 2021
@eregon
Copy link
Member

eregon commented Oct 13, 2021

This was fixed by @aardvark179 in a52bd0f

@eregon eregon closed this as completed Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants