Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str::not_utf8 condition is a sign that the from_bytes API needs massaging #8968

Closed
pnkfelix opened this issue Sep 4, 2013 · 12 comments · Fixed by #12039
Closed

str::not_utf8 condition is a sign that the from_bytes API needs massaging #8968

pnkfelix opened this issue Sep 4, 2013 · 12 comments · Fixed by #12039

Comments

@pnkfelix
Copy link
Member

pnkfelix commented Sep 4, 2013

The current code for str::not_utf8 condition does this:

        cond.raise(fmt!("from_bytes: input is not UTF-8; first bad byte is %u",
                        first_bad_byte as uint))

It might be more reasonable for condition handlers to recover in this situation if they had a reference to the original byte vector, rather than just a string describing what was wrong.

(In one case, the byte vector being passed is a ~[u8], but in another it is &[u8]; in the latter, case, attempting to do this would run into #5370, unless we were willing to allocate a copy when raising the condition, which is probably not a good idea.)

(Update: blake2-ppc makes a reasonable suggestion in his comment below, though it involves a more substantial revision to the code. I've generalized the title of this ticket accordingly.)

@bluss
Copy link
Member

bluss commented Sep 4, 2013

std::str should provide a decoder function that scans the byte stream and raises conditions to handle invalid data, to offer the normal Skip | Truncate | Replace | ReplaceDefault | Fail resolutions.

@catamorphism
Copy link
Contributor

I'm responsible for this code. I'm fine with changing the condition to carry a byte vector instead of a message.

@bluss
Copy link
Member

bluss commented Sep 9, 2013

Here is some test code to check the performance difference of conditions vs. statically specifying the recovery strategy (inserting replacement char). Conditions have a small cost when setting up to catch them, and a comparatively big cost if they fire, but otherwise it's fine.

https://gist.github.com/anonymous/904d9e4c90a1a199f452

by the way pnkfelix, the from_bytes_slice(&[u8]) -> &str API can not do any modifications, so while it could report the invalid sequences correctly, it can't recover.

@huonw
Copy link
Member

huonw commented Sep 9, 2013

Adding to that last comment, even &mut [u8] can't recover in general, and splicing replacement bytes into a ~[u8] is O(n), since the replacement character is multiple bytes.

@bluss
Copy link
Member

bluss commented Sep 9, 2013

The general &[u8] -> ~str decoder with recovery would allocate a new str from the start anyway (as in the benchmark), so any replacement can be spliced in.

For an iterator interface, arbitrary-length replacements sound much more complicated.

@huonw
Copy link
Member

huonw commented Sep 9, 2013

For an iterator interface, arbitrary-length replacements sound much more complicated.

I don't think it's too hard, i.e. the following would allow splicing an arbitrary number of bytes into a stream (clearly one can generalise to streams of any type, and one could easily restrict to Option or another enum if it was only ever necessary to splice a small number of items at a time (and this number is bounded)):

struct FooDecoder {
    splice: ~[u8]
    // ... other fields ...
}

impl Iterator<u8> for FooDecoder {
    fn next(&mut self) -> Option<u8> {
        if !self.splice.is_empty() {
            // left-over splices from the previous `next` call.
            Some(self.splice.pop())
        } else {
            // ... normal activities ...
            if need_to_splice { 
                // splicing [x,y,z], need to go on in reverse order to match the `pop`
                self.splice.push(z);
                self.splice.push(y);
                return Some(x);
            }
            // ... more normal activities ...
        }
    }
}

@pnkfelix
Copy link
Member Author

Hmm. I think this is actually a duplicate of #4837; closing in favor of that.

@huonw
Copy link
Member

huonw commented Sep 10, 2013

Reopening because #4837 is two-bugs-in-one and this is more specific.

@pnkfelix
Copy link
Member Author

see also #6164 for a more open-ended discussion of problems here.

@emberian
Copy link
Member

emberian commented Feb 6, 2014

Conditions are gone, but I think this is still relevant.

@pnkfelix
Copy link
Member Author

pnkfelix commented Feb 6, 2014

@cmr well maybe at this point we're better off closing this ticket and letting the remaining API discussion continue in #9516

@emberian
Copy link
Member

emberian commented Feb 6, 2014

Yeah, closing.

@emberian emberian closed this as completed Feb 6, 2014
alexcrichton added a commit to alexcrichton/rust that referenced this issue Feb 6, 2014
This has been a long time coming. Conditions in rust were initially envisioned
as being a good alternative to error code return pattern. The idea is that all
errors are fatal-by-default, and you can opt-in to handling the error by
registering an error handler.

While sounding nice, conditions ended up having some unforseen shortcomings:

* Actually handling an error has some very awkward syntax:

    let mut result = None;
    let mut answer = None;
    io::io_error::cond.trap(|e| { result = Some(e) }).inside(|| {
        answer = Some(some_io_operation());
    });
    match result {
        Some(err) => { /* hit an I/O error */ }
        None => {
            let answer = answer.unwrap();
            /* deal with the result of I/O */
        }
    }

  This pattern can certainly use functions like io::result, but at its core
  actually handling conditions is fairly difficult

* The "zero value" of a function is often confusing. One of the main ideas
  behind using conditions was to change the signature of I/O functions. Instead
  of read_be_u32() returning a result, it returned a u32. Errors were notified
  via a condition, and if you caught the condition you understood that the "zero
  value" returned is actually a garbage value. These zero values are often
  difficult to understand, however.

  One case of this is the read_bytes() function. The function takes an integer
  length of the amount of bytes to read, and returns an array of that size. The
  array may actually be shorter, however, if an error occurred.

  Another case is fs::stat(). The theoretical "zero value" is a blank stat
  struct, but it's a little awkward to create and return a zero'd out stat
  struct on a call to stat().

  In general, the return value of functions that can raise error are much more
  natural when using a Result as opposed to an always-usable zero-value.

* Conditions impose a necessary runtime requirement on *all* I/O. In theory I/O
  is as simple as calling read() and write(), but using conditions imposed the
  restriction that a rust local task was required if you wanted to catch errors
  with I/O. While certainly an surmountable difficulty, this was always a bit of
  a thorn in the side of conditions.

* Functions raising conditions are not always clear that they are raising
  conditions. This suffers a similar problem to exceptions where you don't
  actually know whether a function raises a condition or not. The documentation
  likely explains, but if someone retroactively adds a condition to a function
  there's nothing forcing upstream users to acknowledge a new point of task
  failure.

* Libaries using I/O are not guaranteed to correctly raise on conditions when an
  error occurs. In developing various I/O libraries, it's much easier to just
  return `None` from a read rather than raising an error. The silent contract of
  "don't raise on EOF" was a little difficult to understand and threw a wrench
  into the answer of the question "when do I raise a condition?"

Many of these difficulties can be overcome through documentation, examples, and
general practice. In the end, all of these difficulties added together ended up
being too overwhelming and improving various aspects didn't end up helping that
much.

A result-based I/O error handling strategy also has shortcomings, but the
cognitive burden is much smaller. The tooling necessary to make this strategy as
usable as conditions were is much smaller than the tooling necessary for
conditions.

Perhaps conditions may manifest themselves as a future entity, but for now
we're going to remove them from the standard library.

Closes rust-lang#9795
Closes rust-lang#8968
bors added a commit that referenced this issue Feb 7, 2014
This has been a long time coming. Conditions in rust were initially envisioned
as being a good alternative to error code return pattern. The idea is that all
errors are fatal-by-default, and you can opt-in to handling the error by
registering an error handler.

While sounding nice, conditions ended up having some unforseen shortcomings:

* Actually handling an error has some very awkward syntax:

        let mut result = None;                                        
        let mut answer = None;                                        
        io::io_error::cond.trap(|e| { result = Some(e) }).inside(|| { 
            answer = Some(some_io_operation());                       
        });                                                           
        match result {                                                
            Some(err) => { /* hit an I/O error */ }                   
            None => {                                                 
                let answer = answer.unwrap();                         
                /* deal with the result of I/O */                     
            }                                                         
        }                                                             

  This pattern can certainly use functions like io::result, but at its core
  actually handling conditions is fairly difficult

* The "zero value" of a function is often confusing. One of the main ideas
  behind using conditions was to change the signature of I/O functions. Instead
  of read_be_u32() returning a result, it returned a u32. Errors were notified
  via a condition, and if you caught the condition you understood that the "zero
  value" returned is actually a garbage value. These zero values are often
  difficult to understand, however.

  One case of this is the read_bytes() function. The function takes an integer
  length of the amount of bytes to read, and returns an array of that size. The
  array may actually be shorter, however, if an error occurred.

  Another case is fs::stat(). The theoretical "zero value" is a blank stat
  struct, but it's a little awkward to create and return a zero'd out stat
  struct on a call to stat().

  In general, the return value of functions that can raise error are much more
  natural when using a Result as opposed to an always-usable zero-value.

* Conditions impose a necessary runtime requirement on *all* I/O. In theory I/O
  is as simple as calling read() and write(), but using conditions imposed the
  restriction that a rust local task was required if you wanted to catch errors
  with I/O. While certainly an surmountable difficulty, this was always a bit of
  a thorn in the side of conditions.

* Functions raising conditions are not always clear that they are raising
  conditions. This suffers a similar problem to exceptions where you don't
  actually know whether a function raises a condition or not. The documentation
  likely explains, but if someone retroactively adds a condition to a function
  there's nothing forcing upstream users to acknowledge a new point of task
  failure.

* Libaries using I/O are not guaranteed to correctly raise on conditions when an
  error occurs. In developing various I/O libraries, it's much easier to just
  return `None` from a read rather than raising an error. The silent contract of
  "don't raise on EOF" was a little difficult to understand and threw a wrench
  into the answer of the question "when do I raise a condition?"

Many of these difficulties can be overcome through documentation, examples, and
general practice. In the end, all of these difficulties added together ended up
being too overwhelming and improving various aspects didn't end up helping that
much.

A result-based I/O error handling strategy also has shortcomings, but the
cognitive burden is much smaller. The tooling necessary to make this strategy as
usable as conditions were is much smaller than the tooling necessary for
conditions.

Perhaps conditions may manifest themselves as a future entity, but for now
we're going to remove them from the standard library.

Closes #9795
Closes #8968
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants