diff --git a/text/0000-read-all.md b/text/0000-read-all.md index e3686756335..c242535a275 100644 --- a/text/0000-read-all.md +++ b/text/0000-read-all.md @@ -1,58 +1,74 @@ -- Feature Name: read_exact and read_full +- Feature Name: read_exact and ErrorKind::UnexpectedEOF - Start Date: 2015-03-15 - RFC PR: (leave this empty) - Rust Issue: (leave this empty) # Summary -Rust's `Write` trait has `write_all`, which is a convenience method that calls -`write` repeatedly to write an entire buffer. This proposal adds two similar -convenience methods to the `Read` trait: `read_full` and `read_exact`. -`read_full` calls `read` repeatedly until the buffer has been filled, EOF has -been reached, or an error other than `Interrupted` occurs. `read_exact` is -similar to `read_full`, except that reaching EOF before filling the buffer is -considered an error. +Rust's `Write` trait has the `write_all` method, which is a convenience +method that writes a whole buffer, failing with `ErrorKind::WriteZero` +if the buffer cannot be written in full. + +This RFC proposes adding its `Read` counterpart: a method (here called +`read_exact`) that reads a whole buffer, failing with an error (here +called `ErrorKind::UnexpectedEOF`) if the buffer cannot be read in full. # Motivation -The `read` method may return fewer bytes than requested, and may fail with an -`Interrupted` error if a signal is received during the call. This requires -programs wishing to fill a buffer to call `read` repeatedly in a loop. This is -a very common need, and it would be nice if this functionality were provided in -the standard library. Many C and Rust programs have the same need, and solve it -in the same way. For example, Git has [`read_in_full`][git], which behaves like -the proposed `read_full`, and the Rust byteorder crate has -[`read_full`][byteorder], which behaves like the proposed `read_exact`. -[git]: https://github.com/git/git/blob/16da57c7c6c1fe92b32645202dd19657a89dd67d/wrapper.c#L246 -[byteorder]: https://github.com/BurntSushi/byteorder/blob/2358ace61332e59f596c9006e1344c97295fdf72/src/new.rs#L184 +When dealing with serialization formats with fixed-length fields, +reading or writing less than the field's size is an error. For the +`Write` side, the `write_all` method does the job; for the `Read` side, +however, one has to call `read` in a loop until the buffer is completely +filled, or until a premature EOF is reached. + +This leads to a profusion of similar helper functions. For instance, the +`byteorder` crate has a `read_full` function, and the `postgres` crate +has a `read_all` function. However, their handling of the premature EOF +condition differs: the `byteorder` crate has its own `Error` enum, with +`UnexpectedEOF` and `Io` variants, while the `postgres` crate uses an +`io::Error` with an `io::ErrorKind::Other`. + +That can make it unnecessarily hard to mix uses of these helper +functions; for instance, if one wants to read a 20-byte tag (using one's +own helper function) followed by a big-endian integer, either the helper +function has to be written to use `byteorder::Error`, or the calling +code has to deal with two different ways to represent a premature EOF, +depending on which field encountered the EOF condition. + +Additionally, when reading from an in-memory buffer, looping is not +necessary; it can be replaced by a size comparison followed by a +`copy_memory` (similar to `write_all` for `&mut [u8]`). If this +non-looping implementation is `#[inline]`, and the buffer size is known +(for instance, it's a fixed-size buffer in the stack, or there was an +earlier check of the buffer size against a larger value), the compiler +could potentially turn a read from the buffer followed by an endianness +conversion into the native endianness (as can happen when using the +`byteorder` crate) into a single-instruction direct load from the buffer +into a register. # Detailed design -The following methods will be added to the `Read` trait: +First, a new variant `UnexpectedEOF` is added to the `io::ErrorKind` enum. + +The following method is added to the `Read` trait: ``` rust -fn read_full(&mut self, buf: &mut [u8]) -> Result; fn read_exact(&mut self, buf: &mut [u8]) -> Result<()>; ``` -Additionally, default implementations of these methods will be provided: +Aditionally, a default implementation of this method is provided: ``` rust -fn read_full(&mut self, mut buf: &mut [u8]) -> Result { - let mut read = 0; - while buf.len() > 0 { +fn read_exact(&mut self, mut buf: &mut [u8]) -> Result<()> { + while !buf.is_empty() { match self.read(buf) { Ok(0) => break, - Ok(n) => { read += n; let tmp = buf; buf = &mut tmp[n..]; } + Ok(n) => { let tmp = buf; buf = &mut tmp[n..]; } Err(ref e) if e.kind() == ErrorKind::Interrupted => {} Err(e) => return Err(e), } } - Ok(read) -} - -fn read_exact(&mut self, buf: &mut [u8]) -> Result<()> { - if try!(self.read_full(buf)) != buf.len() { + if !buf.is_empty() { Err(Error::new(ErrorKind::UnexpectedEOF, "failed to fill whole buffer")) } else { Ok(()) @@ -60,40 +76,186 @@ fn read_exact(&mut self, buf: &mut [u8]) -> Result<()> { } ``` -Finally, a new `ErrorKind::UnexpectedEOF` will be introduced, which will be -returned by `read_exact` in the event of a premature EOF. +And an optimized implementation of this method for `&[u8]` is provided: + +```rust +#[inline] +fn read_exact(&mut self, buf: &mut [u8]) -> Result<()> { + if (buf.len() > self.len()) { + return Err(Error::new(ErrorKind::UnexpectedEOF, "failed to fill whole buffer")); + } + let (a, b) = self.split_at(buf.len()); + slice::bytes::copy_memory(a, buf); + *self = b; + Ok(()) +} +``` + +The detailed semantics of `read_exact` are as follows: `read_exact` +reads exactly the number of bytes needed to completely fill its `buf` +parameter. If that's not possible due to an "end of file" condition +(that is, the `read` method would return 0 even when passed a buffer +with at least one byte), it returns an `ErrorKind::UnexpectedEOF` error. + +On success, the read pointer is advanced by the number of bytes read, as +if the `read` method had been called repeatedly to fill the buffer. On +any failure (including an `ErrorKind::UnexpectedEOF`), the read pointer +might have been advanced by any number between zero and the number of +bytes requested (inclusive), and the contents of its `buf` parameter +should be treated as garbage (any part of it might or might not have +been overwritten by unspecified data). + +Even if the failure was an `ErrorKind::UnexpectedEOF`, the read pointer +might have been advanced by a number of bytes less than the number of +bytes which could be read before reaching an "end of file" condition. + +The `read_exact` method will never return an `ErrorKind::Interrupted` +error, similar to the `read_to_end` method. + +Similar to the `read` method, no guarantees are provided about the +contents of `buf` when this function is called; implementations cannot +rely on any property of the contents of `buf` being true. It is +recommended that implementations only write data to `buf` instead of +reading its contents. + +# About ErrorKind::Interrupted + +Whether or not `read_exact` can return an `ErrorKind::Interrupted` error +is orthogonal to its semantics. One could imagine an alternative design +where `read_exact` could return an `ErrorKind::Interrupted` error. + +The reason `read_exact` should deal with `ErrorKind::Interrupted` itself +is its non-idempotence. On failure, it might have already partially +advanced its read pointer an unknown number of bytes, which means it +can't be easily retried after an `ErrorKind::Interrupted` error. + +One could argue that it could return an `ErrorKind::Interrupted` error +if it's interrupted before the read pointer is advanced. But that +introduces a non-orthogonality in the design, where it might either +return or retry depending on whether it was interrupted at the beginning +or in the middle. Therefore, the cleanest semantics is to always retry. + +There's precedent for this choice in the `read_to_end` method. Users who +need finer control should use the `read` method directly. + +# About the read pointer + +This RFC proposes a `read_exact` function where the read pointer +(conceptually, what would be returned by `Seek::seek` if the stream was +seekable) is unspecified on failure: it might not have advanced at all, +have advanced in full, or advanced partially. + +Two possible alternatives could be considered: never advance the read +pointer on failure, or always advance the read pointer to the "point of +error" (in the case of `ErrorKind::UnexpectedEOF`, to the end of the +stream). + +Never advancing the read pointer on failure would make it impossible to +have a default implementation (which calls `read` in a loop), unless the +stream was seekable. It would also impose extra costs (like creating a +temporary buffer) to allow "seeking back" for non-seekable streams. + +Always advancing the read pointer to the end on failure is possible; it +happens without any extra code in the default implementation. However, +it can introduce extra costs in optimized implementations. For instance, +the implementation given above for `&[u8]` would need a few more +instructions in the error case. Some implementations (for instance, +reading from a compressed stream) might have a larger extra cost. + +The utility of always advancing the read pointer to the end is +questionable; for non-seekable streams, there's not much that can be +done on an "end of file" condition, so most users would discard the +stream in both an "end of file" and an `ErrorKind::UnexpectedEOF` +situation. For seekable streams, it's easy to seek back, but most users +would treat an `ErrorKind::UnexpectedEOF` as a "corrupted file" and +discard the stream anyways. + +Users who need finer control should use the `read` method directly, or +when available use the `Seek` trait. + +# Naming + +It's unfortunate that `write_all` used `WriteZero` for its `ErrorKind`; +were it named `UnexpectedEOF` (which is a much more intuitive name), the +same `ErrorKind` could be used for both functions. + +The initial proposal for this `read_exact` method called it `read_all`, +for symmetry with `write_all`. However, that name could also be +interpreted as "read as many bytes as you can that fit on this buffer, +and return what you could read" instead of "read enough bytes to fill +this buffer, and fail if you couldn't read them all". The previous +discussion led to `read_exact` for the later meaning, and `read_full` +for the former meaning. # Drawbacks -Like `write_all`, these APIs are lossy: in the event of an error, there is no -way to determine the number of bytes that were successfully read before the -error. However, doing so would complicate the methods, and the caller will want -to simply fail if an error occurs the vast majority of the time. Situations -that require lower level control can still use `read` directly. +If this method fails, the buffer contents are undefined; the +`read_exact' method might have partially overwritten it. If the caller +requires "all-or-nothing" semantics, it must clone the buffer. In most +use cases, this is not a problem; the caller will discard or overwrite +the buffer in case of failure. -# Unanswered Questions +In the same way, if this method fails, there is no way to determine how +many bytes were read before it determined it couldn't completely fill +the buffer. -Naming. Is `read_full` the best name? Should `UnexpectedEOF` instead be -`ShortRead` or `ReadZero`? +Situations that require lower level control can still use `read` +directly. # Alternatives -Use a more complicated return type to allow callers to retrieve the number of -bytes successfully read before an error occurred. As explained above, this -would complicate the use of these methods for very little gain. It's worth -noting that git's `read_in_full` is similarly lossy, and just returns an error -even if some bytes have been read. - -Only provide `read_exact`, but parameterize the `UnexpectedEOF` or `ShortRead` -error kind with the number of bytes read to allow it to be used in place of -`read_full`. This would be less convenient to use in cases where EOF is not an -error. - -Only provide `read_full`. This would cover most of the convenience (callers -could avoid the read loop), but callers requiring a filled buffer would have to -manually check if all of the desired bytes were read. - -Finally, we could leave this out, and let every Rust user needing this -functionality continue to write their own `read_full` or `read_exact` function, -or have to track down an external crate just for one straightforward and -commonly used convenience method. +The first alternative is to do nothing. Every Rust user needing this +functionality continues to write their own read_full or read_exact +function, or have to track down an external crate just for one +straightforward and commonly used convenience method. Additionally, +unless everybody uses the same external crate, every reimplementation of +this method will have slightly different error handling, complicating +mixing users of multiple copies of this convenience method. + +The second alternative is to just add the `ErrorKind::UnexpectedEOF` or +similar. This would lead in the long run to everybody using the same +error handling for their version of this convenience method, simplifying +mixing their uses. However, it's questionable to add an `ErrorKind` +variant which is never used by the standard library. + +Another alternative is to return the number of bytes read in the error +case. That makes the buffer contents defined also in the error case, at +the cost of increasing the size of the frequently-used `io::Error` +struct, for a rarely used return value. My objections to this +alternative are: + +* If the caller has an use for the partially written buffer contents, + then it's treating the "buffer partially filled" case as an + alternative success case, not as a failure case. This is not a good + match for the semantics of an `Err` return. +* Determining that the buffer cannot be completely filled can in some + cases be much faster than doing a partial copy. Many callers are not + going to be interested in an incomplete read, meaning that all the + work of filling the buffer is wasted. +* As mentioned, it increases the size of a commonly used type in all + cases, even when the code has no mention of `read_exact`. + +The final alternative is `read_full`, which returns the number of bytes +read (`Result`) instead of failing. This means that every caller +has to check the return value against the size of the passed buffer, and +some are going to forget (or misimplement) the check. It also prevents +some optimizations (like the early return in case there will never be +enough data). There are, however, valid use cases for this alternative; +for instance, reading a file in fixed-size chunks, where the last chunk +(and only the last chunk) can be shorter. I believe this should be +discussed as a separate proposal; its pros and cons are distinct enough +from this proposal to merit its own arguments. + +I believe that the case for `read_full` is weaker than `read_exact`, for +the following reasons: + +* While `read_exact` needs an extra variant in `ErrorKind`, `read_full` + has no new error cases. This means that implementing it yourself is + easy, and multiple implementations have no drawbacks other than code + duplication. +* While `read_exact` can be optimized with an early return in cases + where the reader knows its total size (for instance, reading from a + compressed file where the uncompressed size was given in a header), + `read_full` has to always write to the output buffer, so there's not + much to gain over a generic looping implementation calling `read`. +