-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue for Read::chars #27802
Comments
It would be nice if |
If we want to make let mut chars = reader.chars();
loop {
for c in chars {
// ...
}
if chars.was_valid_utf8() { break; }
println!("Encountered invalid byte sequence.");
} Or it could provide a more informative message, similar to the current CharsError. On the other hand, it is not so difficult to treat the |
Nominating for 1.6 discussion |
🔔 This issue is now entering its cycle-long final comment period for deprecation 🔔 These functions seem like excellent candidates to move out-of-tree into an |
Perhaps a |
I’m in favor of stabilizing It’s unstable because
(The same would apply to This behavior should be per Unicode Standard §5.22 "Best Practice for U+FFFD Substitution" http://www.unicode.org/versions/Unicode8.0.0/ch05.pdf#G40630 Roughly, that means stopping at the first unexpected byte. This is not the behavior currently implemented, which reads as many bytes as indicated by the first byte and then checks them. This is a problem as, with only Here are some failing tests. let mut buf = Cursor::new(&b"\xf0\x9fabc"[..]);
let mut chars = buf.chars();
assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
assert_eq!(chars.next().unwrap().unwrap(), 'a');
assert_eq!(chars.next().unwrap().unwrap(), 'b');
assert_eq!(chars.next().unwrap().unwrap(), 'c');
assert!(chars.next().is_none());
let mut buf = Cursor::new(&b"\xed\xa0\x80a"[..]);
let mut chars = buf.chars();
assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
assert_eq!(chars.next().unwrap().unwrap(), 'a');
assert!(chars.next().is_none());
let mut buf = Cursor::new(&b"\xed\xa0a"[..]);
let mut chars = buf.chars();
assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
assert_eq!(chars.next().unwrap().unwrap(), 'a');
assert!(chars.next().is_none()); I’ve looked at fixing this, but it basically involves duplicating all of the UTF-8 decoding logic from |
Moving |
While I agree that |
How does |
@SimonSapin there were some perf numbers reading from a file with |
I think |
I may be doing something hugely inefficient here but I really needed read.chars for a crate I was working on that attempted to produce an iterator of rustc serialize json items from a read stream. I had to fallback on vendoring most of the source around read.chars to as a workaround to get it working on stable awhile back. It was almost a requirement as the rustc builder for json requires a iterable of chars. It would be really nice to have this stabilized if env in some form of method name that implied lossyness. |
I would expect it to perform much better than on a raw @cybergeek94 Yeah I suspect the performance isn't abysmal, but when compared to Yeah that's actually a case where I believe |
@alexcrichton |
In this case it's a literally a stream of json I'm working with. My use case was interfacing with dockers streaming json endpoints. In which case json objects are pushed through a streamed response. I'm not sure how I'd accomplish that with a string. |
@alexcrichton Hnadling the boundary is tricky. It’s handled in https://github.com/SimonSapin/rust-utf8/blob/master/lib.rs |
@gyscos hm yes, I guess it would! That would definitely mean that @softprops in theory you can transform an iterator of |
So a solution is to move |
Yeah I have a feeling that would be sufficient, there's more fine-grained error information we can give (such as the bytes that were read, if any), but iterators like It does raise a question though that if |
I think the reasoning to move |
The performance issue being alleviated is only a benevolent side-effect. |
Hm I don't think that this is a correctness problem that can be solved by "just moving to |
@SimonSapin was saying that |
Does this sound correct?: The issue could be summarised as: How to handle reading chars over a datastream, specifically how to handle either incomplete utf8 byte sequences and incorrect bytes (w.r.t. utf8). It's hard because when you encounter an invalid sequence of bytes that may be valid with more data, it could either be an error, or that you just need to read more bytes (it's ambiguous). If you know it's incomplete, you may want to call Read and try again with the incomplete part prepended, but if it's incomplete because something has errored you want to return the error with the offending bytes. It seems the consensus is that for error and incomplete bytes, you return a struct that contains an enum variant saying whether it is an error or a possibly incomplete set of bytes, along with the bytes. It's then the responsibility of a higher level iterator how to handle these cases (as not all use cases will want to handle it the same). |
Previouly, skim relys on nightly rust for `io::chars` Now use crate utf8parse instead. Check rust-lang/rust#27802 (comment)
Use private char iterator as done in kkawakam/rustyline#38 while waiting for stabilisation of the chars method per rust-lang/rust#27802 This removes the need for `#[feature(io)]` letting skim compile on rust stable.
TL;DR: I think it is very hard to come up with an abstraction that: is zero-cost, covers all use cases, and is not terrible to use. I’m in favor of deprecating and eventually removing this with no in- I think that anything that looks at one I spent some time thinking of a low-level API that would make no assumptions about how one would want to use it ("pushing" vs "pulling" bytes and string slices, buffer allocation strategy, error handling, etc.) I came up with this: pub fn decode_utf8(input: &[u8]) -> DecodeResult { /* ... */ }
pub enum DecodeResult<'a> {
Ok(&'a str),
/// These three slices cover all of the original input.
/// `decode` should be called again with the third one as the new input.
Error(&'a str, InvalidSequence<'a>, &'a [u8]),
Incomplete(&'a str, IncompleteChar),
}
pub struct InvalidSequence<'a>(pub &'a [u8]);
pub struct IncompleteChar {
// Fields are private. They include a [u8; 4] buffer.
}
impl IncompleteChar {
pub fn try_complete<'char, 'input>(&'char mut self, mut input: &'input [u8])
-> TryCompleteResult<'char, 'input> { /* ... */ }
}
pub enum TryCompleteResult<'char, 'input> {
Ok(&'char str, &'input [u8]), // str.chars().count() == 1
Error(InvalidSequence<'char>, &'input [u8]),
StillIncomplete,
} It’s complicated. It requires the user to think about a lot of corner cases, especially around We can hide some of the details with a stateful decoder: pub struct Decoder { /* Private. Also with a [u8; 4] buffer. */ }
impl Decoder {
pub fn new() -> Self;
pub fn decode<'decoder, 'input>(&'decoder mut self, &'input [u8])
-> DecoderResult<'decoder, 'input>;
/// Signal that there is no more input.
/// The decoder might contain a partial `char` which becomes an error.
pub fn end<'decoder>(&self) -> Result<(), InvalidSequence<'decoder>>>;
}
/// Order of fields indicates order in the input
pub struct DecoderResult<'decoder, 'input> {
/// str in the `Ok` case is either empty or one `char` (up to 4 bytes)
pub partially_from_previous_input_chunk: Result<&'decoder str, InvalidSequence<'decoder>>,
/// Up to the first error, if any
pub decoded: &'input str,
/// Whether we did find an error
pub error: Result<(), InvalidSequence<'input>>
/// Call `decoder.decode()` again with this, if non-empty
pub remaining_input_after_error: &'input [u8]
}
/// Never more than 3 bytes.
pub struct InvalidSequence<'a>(pub &'a [u8]); Even so, it’s very easy to misuse, for example by ignoring part of Either of these is complicated enough that I don’t think it belongs in |
I would support deprecating |
Another attempt turned out almost nice: pub struct Decoder {
buffer: [u8; 4],
/* ... */
}
impl Decoder {
pub fn new() -> Self { /* ... */ }
pub fn next_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
pub fn last_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
}
pub struct DecoderIter<'a> {
decoder: &'a mut Decoder,
/* ... */
}
impl<'a> Iterator for DecoderIter<'a> {
type Item = Result<&'a str, &'a [u8]>;
} Except it doesn’t work. impl<'a> DecoderIter<'a> {
pub fn next(&mut self) -> Option<Result<&str, &[u8]>> { /* ... */ }
} let mut iter = decoder.next_chunk(input);
while let Some(result) = iter.next() {
// ...
} This compiles, but something like We can work around that by adding enough lifetimes parameters and one weird enum… but yeah, no. pub struct Decoder { /* ... */ }
impl Decoder {
pub fn new() -> Self { /* ... */ }
pub fn next_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
-> DecoderIter<'decoder, 'input> { /* ... */ }
pub fn last_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
-> DecoderIter<'decoder, 'input> { /* ... */ }
}
pub struct DecoderIter<'decoder, 'input> { /* ... */ }
impl<'decoder, 'input> DecoderIter<'decoder, 'input> {
pub fn next<'buffer>(&'buffer mut self)
-> Option<Result<EitherLifetime<'buffer, 'input, str>,
EitherLifetime<'buffer, 'input, [u8]>>> { /* ... */ }
}
pub enum EitherLifetime<'buffer, 'input, T: ?Sized + 'static> {
Buffer(&'buffer T),
Input(&'input T),
}
impl<'buffer, 'input, T: ?Sized> EitherLifetime<'buffer, 'input, T> {
pub fn get<'a>(&self) -> &'a T where 'buffer: 'a, 'input: 'a {
match *self {
EitherLifetime::Input(x) => x,
EitherLifetime::Buffer(x) => x,
}
}
} |
Can you elaborate? I don't follow here. |
Perhaps it’s clearer with code. This does not compile: https://gist.github.com/anonymous/0587b4484ec9a15f5c5ce6908b3807c1, unless you change |
I tend to agree that this should be removed from |
hsivonen/encoding_rs#8 has some discussion of Unicode stream and decoders for not-only-UTF-8 encodings. |
The libs team discussed this and consensus was to deprecate the @rfcbot fcp close Code that does not care about processing data incrementally can use Of course the above is for data that is always UTF-8. If other character encoding need to be supported, consider using the |
Team member @SimonSapin has proposed to close this. The next step is review by the rest of the tagged teams: No concerns currently listed. Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! See this document for info about what commands tagged team members can give me. |
🔔 This is now entering its final comment period, as per the review above. 🔔 |
The final comment period is now complete. |
Deprecate Read::chars and char::decode_utf8 Per FCP: * rust-lang#27802 (comment) * rust-lang#33906 (comment)
Deprecated in #49970 |
This is a tracking issue for the deprecated
std::io::Read::chars
API.The text was updated successfully, but these errors were encountered: