Add new function str::from_utf8_lossy() #12062

lilyball · 2014-02-06T08:04:18Z

from_utf8_lossy() takes a byte vector and produces a ~str, converting
any invalid UTF-8 sequence into the U+FFFD REPLACEMENT CHARACTER.

The replacement follows the guidelines in §5.22 Best Practice for U+FFFD
Substitution from the Unicode Standard (Version 6.2)1, which also
matches the WHATWG rules for utf-8 decoding2.

Closes #9516.

lifthrasiir · 2014-02-06T08:32:13Z

Care is taken to try to convert invalid UTF-8 sequences into as few replacement characters as are needed.

This is different from the behavior of the WHATWG Encoding standard, which converts the sequence up to the first bad byte into one replacement character and continues to process the following bytes at the initial state. While there is no particular reason to use this algorithm (the number of replacement characters is arbitrary anyway) there is also no reason to diverge from the WHATWG standard.

huonw · 2014-02-06T09:15:50Z

@lifthrasiir is it worth just pulling your encoding library into the main tree now that we have nice crates? (Assuming it handles cases like this.)

lifthrasiir · 2014-02-06T09:57:03Z

@huonw Except that the library is not quite matured enough, I do not object. Maybe we can just put #[unstable] for now.

huonw · 2014-02-06T10:14:12Z

I think having a full featured lib that gets kept up to date and (hopefully) incrementally improved would be better than patching over holes with functions like this. (And it also allows the new URL module to be pulled in, since, iirc, it depends on rust-encoding?)

Others may disagree...

lilyball · 2014-02-06T18:00:17Z

@lifthrasiir I was trying to find that link and couldn't find it. Thanks.

@huonw Having encodings in the main tree would be nice, although I am slightly worried about the size of the encoding tables (I'm not familiar with @lifthrasiir's library in particular but I assume it uses encoding tables like everything else). I do think that support UTF-8 like this can be done without having a full encoding library.

lilyball · 2014-02-06T18:28:59Z

@lifthrasiir After reading that algorithm, that's pretty similar to what I'm doing. The two big differences:

My algorithm supports overlong encodings and re-encodes them in the canonical length.
That algorithm I think is going to emit 3 U+FFFDs for a surrogate character, which seems a bit absurd to me. My algorithm only emits one. Still, this does follow the rule defined in §5.22 of the Unicode standard.

I can modify this implementation to conform to that algorithm. However, I did intentionally decide to support overlong encodings. Is that a decision I should reverse?

lilyball · 2014-02-06T19:00:54Z

Implementing the WHATWG algorithm is pretty straightforward, but unfortunately a straightforward implementation is almost 30% slower on multibyte characters than what I already have 😠 But if we think that behaving the same as the WHATWG algorithm is desired this can surely be optimized.

ghost · 2014-02-06T19:02:22Z

RFC 3629 says:

Implementations of the decoding algorithm above MUST protect against
decoding invalid sequences. For instance, a naive implementation may
decode the overlong UTF-8 sequence C0 80 into the character U+0000,
or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding
invalid sequences may have security consequences or cause other
problems.

and

Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F. This last exploit has
actually been used in a widespread virus attacking Web servers in
2001; thus, the security threat is very real.

lilyball · 2014-02-06T19:05:22Z

@Jurily Yes, but I'm not decoding into codepoints, I'm decoding back into another UTF-8 stream. I originally decided to support overlong encodings because that way someone could decide that, for whatever reason, they need to be able to use the C0 80 encoding of U+0000, and then could handle it with this function. The idea is that any security filtering of the stream can happen on the re-encoded utf-8 stream.

However, I suspect that this is an edge case that it's probably better not to support. So I'm comfortable saying that we definitely should not support overlong encodings at this point.

lilyball · 2014-02-06T21:12:21Z

Rewriting the algorithm to be WHATWG-equivalent without actually using the WHATWG algorithm ends up being up to 25% faster in the multibyte case than my original proposed algorithm (which makes it significantly better than the WHATWG version).

lilyball · 2014-02-06T21:23:29Z

I've pushed the new version. It now matches the WHATWG algorithm's output, and no longer allows for overlong encodings.

lilyball · 2014-02-06T21:26:19Z

@huonw To expand on my previous comment, I think that providing a full encodings library in the core distribution is a good idea, but it should remain in its own crate. On the other hand, UTF-8 decoding is so important to Rust (because its native str type requires UTF-8) that a version of the decoder that supports U+FFFD should be provided in libstd.

lilyball · 2014-02-07T00:40:19Z

This PR will close #9516.

huonw · 2014-02-07T01:15:06Z

src/libstd/str.rs

 )

 static TAG_CONT_U8: u8 = 128u8;

+/// Converts a vector of bytes to a new utf-8 string.
+/// Any invalid utf-8 sequences are replaced with U+FFFD REPLACEMENT CHARACTER.


Can we have an example or two?

lilyball · 2014-02-07T01:45:26Z

@huonw Example added.

huonw · 2014-02-07T04:27:21Z

src/libstd/str.rs

+    let mut res = with_capacity(total);
+
+    loop {
+        if i >= total {


while i <= total?

@huonw There was a reason for this, which was first that I had a label on the original loop, and then the new version also did some work if i >= total beyond breaking, but you're right, in the current incarnation it can go back to the original while loop.

Also, it's probably faster to use

let mut ptr = v.as_ptr(); let end = ptr.offset(v.len()); while ptr < end { // ... let byte = *ptr; ptr = ptr.offset(1); // ... }

Although safe_get would require some modification.

The current implementation is patterned off of how is_utf8() works, which was modified to its current incarnation after performance testing. I am curious as to what the performance impact of using your suggestion is.

I guess the bounds checking part of safe_get would be replaced by if !have_enough_space(p, end, w) { /* use REPLACEMENT */ } called once, before the match so the other indexes can be unchecked.

Where have_enough_space would be w <= end as uint - p as uint, I think.

It explicitly doesn't work that way, because if w is 4 but I only have 3 bytes left, those bytes could be something like F0 49 50, which should evaluate to "\uFFFDIJ". In other words, even if there's not enough room, I still have to validate each byte.

Ah, I see... maybe have a slow path if there isn't enough space.

Perhaps. I'll have to do some testing.

Using pointer arithmetic did not help.

Before (as submitted):

test str::bench::from_utf8_lossy_100_ascii ... bench: 261 ns/iter (+/- 32) test str::bench::from_utf8_lossy_100_multibyte ... bench: 440 ns/iter (+/- 48) test str::bench::from_utf8_lossy_invalid ... bench: 89 ns/iter (+/- 5)

After:

test str::bench::from_utf8_lossy_100_ascii ... bench: 272 ns/iter (+/- 32) test str::bench::from_utf8_lossy_100_multibyte ... bench: 447 ns/iter (+/- 48) test str::bench::from_utf8_lossy_invalid ... bench: 95 ns/iter (+/- 9)

Hm, that surprises me slightly, but the numbers don't lie.

lilyball · 2014-02-07T06:13:49Z

I changed loop to while, and removed the unnecessary use of return at the end, but otherwise left it alone. In my tests I couldn't get it to run faster even with what I thought would obviously work, such as using ~[u8] directly instead of ~str (since the raw methods I'm using just end up calling as_owned_vec() first). That case sped up the normal scenario of valid strings slightly, but it significantly slowed down the invalid case. I also tried pointer arithmetic, with no difference, and I tried counting the number of replacements first (to be able to allocate the new string with the correct size) and removing all the checks on the pushes, and that was (quite surprisingly) slower across the board.

Given that, I'm leaving the algorithm alone. It's good enough.

huonw · 2014-02-07T06:53:08Z

src/libstd/str.rs

+        let xs = bytes!("ศไทย中华Việt Nam");
+        assert_eq!(from_utf8_lossy(xs), ~"ศไทย中华Việt Nam");
+
+        let xs = bytes!("Hello", 0xC0, 0x80, " There", 0xE6, 0x83, " Goodbye");


There doesn't seem to be a test for a two byte character with only the first byte?

Also, how about one for completely invalid bytes (e.g. 0xFF)?

Good point about a two-byte char test, but I already test 0xF5 which is an invalid byte.

huonw · 2014-02-07T06:58:27Z

Possible extension (which I don't think should block an r+): return enum MaybeOwned<'a> { Slice(&'a str), Owned(~str) } that only allocates when absolutely required, i.e. basically have the is_utf8 validation loop, and if it finds something invalid break, copy the (known good) string up until that point and fall back to the replacement character loop you've written here. If the validation loop passes, it can return Slice(transmute(bytes)), otherwise it allocates.

An possible (internal) optimisation generalising this: record the last known good character and only copy when necessary (i.e. when you see a invalid byte, copy up until that point, write the replacement character and record the current position).

huonw · 2014-02-07T07:03:45Z

(r=me with the few extra tests.)

huonw · 2014-02-07T07:12:13Z

Thinking about it (warning: stream of thought): this really wants a rope-style data structure that stores segments of &'a str with counts of replacement characters between them, preferably as an iterator.

(I guess we could actually have a "long" "\uFFFD\uFFFD...\uFFFD" static string somewhere and just use slices out of this for the sequences of replacement characters, so the return value could literally be just an Iterator<&'a str>.)

The extra complications will probably offset the zero-allocations in terms of performance.

lilyball · 2014-02-07T07:18:02Z

@huonw MaybeOwned is an interesting suggestion, I've thought about something similar. The problem is it's just so unwieldy to use. Implementing Str on it would make things easier, though. Still, if memory usage is an issue, callers can always try from_utf8() first, which is pretty fast.

I also considered doing what you suggested of noting the last known good character. It's probably worth testing, but given the results of my previous performance experiments, I wouldn't be surprised if this turns out to be no good as well. Still, it's probably worth a go.

A rope-like data structure would be a nice approach for this, but of course that would be a separate implementation entirely as str is not a rope.

huonw · 2014-02-07T07:38:50Z

High-level sketch:

struct LossyUTF8Iter<'a> { bytes: &'a [u8] }
fn lossy_utf8(&'a [u8]) -> LossyUTF8Iter<'a> { /* obvious */ }

impl<'a> Iterator<&'a str> for LossyUTF8Iter<'a> {
    fn next(&mut self) -> Option<&'a str> {
        static REP: &'static str = "\uFFFD\uFFFD\uFFFD\uFFFD"; // some reasonable length

        let good = /* validate first character */;

        if good {
            /* find first invalid byte... */

            return Some(self.bytes.slice_to(index));
        } else {
            /* find first valid character (or 
               REP.char_len() invalid characters,
               whichever comes first)... */

            return Some(REP.slice_to(3 * replacement_count));
        }
    }
}

// helper, not necessary (and won't work yet)
fn slice_if_possible<'a, It: Iterator<&'a str>>(it: It) -> MaybeOwned<'a> {
     // if it has 1 element, return that; otherwise concatentate them all in a new alloc.
}

This is probably fast for long runs of valid or invalid bytes, but maybe not for many isolated invalid bytes. (I suppose it could also just immediately return "\uFFFD" as soon as it finds a single invalid codepoint rather than try to batch them.)

lilyball · 2014-02-07T07:40:16Z

With the current algorithm:

test str::bench::from_utf8_lossy_100_ascii                      ... bench:       246 ns/iter (+/- 12)
test str::bench::from_utf8_lossy_100_invalid                    ... bench:       299 ns/iter (+/- 10)
test str::bench::from_utf8_lossy_100_multibyte                  ... bench:       415 ns/iter (+/- 11)
test str::bench::from_utf8_lossy_invalid                        ... bench:        87 ns/iter (+/- 6)

If I implement the "last good char" suggestion:

test str::bench::from_utf8_lossy_100_ascii                      ... bench:       185 ns/iter (+/- 3)
test str::bench::from_utf8_lossy_100_invalid                    ... bench:       348 ns/iter (+/- 20)
test str::bench::from_utf8_lossy_100_multibyte                  ... bench:       235 ns/iter (+/- 32)
test str::bench::from_utf8_lossy_invalid                        ... bench:        98 ns/iter (+/- 1)

The full invalid case is about 16% slower, but the valid cases are faster (in the case of the multibyte test, significantly faster) so I think it's a win.

from_utf8_lossy() takes a byte vector and produces a ~str, converting any invalid UTF-8 sequence into the U+FFFD REPLACEMENT CHARACTER. The replacement follows the guidelines in §5.22 Best Practice for U+FFFD Substitution from the Unicode Standard (Version 6.2)[1], which also matches the WHATWG rules for utf-8 decoding[2]. [1]: http://www.unicode.org/versions/Unicode6.2.0/ch05.pdf [2]: http://encoding.spec.whatwg.org/#utf-8

lilyball · 2014-02-07T07:45:05Z

Just pushed the "last good character" version, along with a new test for 2-byte characters.

lilyball · 2014-02-07T07:48:54Z

@huonw I like the idea of the iterator, but it just doesn't seem terribly practical. I typically need to work with a full string, not an iterator of adjacent substrings. It's possibly worth exploring separately, but it would be a parallel implementation rather than a replacement for from_utf8_lossy().

huonw · 2014-02-07T07:51:34Z

With some sort of -> ~str (or equivalently, -> MaybeOwned) method it might not be so bad (e.g. if FromIterator was implemented). But yeah, it would require some finesse to be practical.

`from_utf8_lossy()` takes a byte vector and produces a `~str`, converting any invalid UTF-8 sequence into the U+FFFD REPLACEMENT CHARACTER. The replacement follows the guidelines in §5.22 Best Practice for U+FFFD Substitution from the Unicode Standard (Version 6.2)[1], which also matches the WHATWG rules for utf-8 decoding[2]. [1]: http://www.unicode.org/versions/Unicode6.2.0/ch05.pdf [2]: http://encoding.spec.whatwg.org/#utf-8 Closes #9516.

huonw · 2014-02-07T14:29:34Z

I wrote an iterator version and it's much faster when not allocating, and slightly-fast-to-a-lot-slower when allocating.

	iter	iter_collect	str
from_utf8_lossy_100_ascii	126	197	188
from_utf8_lossy_100_invalid	300	371	384
from_utf8_lossy_100_multibyte	162	232	227
from_utf8_lossy_invalid	68	211	121

Also, I have a feeling that Show for Display should be doing something like

if /* no padding */ {
    for s in str::from_utf8_lossy(self.as_bytes()) {
        f.buf.write_str(s)
    }
} else {
    let mut collected = str::with_capacity(...);
    // could micro-optimise the case when self is valid utf8 to 
    // (equivalent of) just f.pad(self)
    for s in str::from_utf8_lossy(self.as_bytes()) { collected.push_str(s) }
    f.pad(collected);
}

and, if necessary, ToStr can delegate to Show via format!("{}", *self).

lilyball · 2014-02-07T18:03:30Z

@huonw The iterator version is interesting, I'm just concerned that nearly all callers are going to need to collect it, at which point it's no better than the str version. Pretty much the only time you can use the iterator version as-is is if you're immediately writing it to some sort of Writer, as in your sample Show impl.

Eliminate bunch of copies of error codepath from Utf8LossyChunksIter Using a macro to stamp out 7 identical copies of the nontrivial slicing logic to exit this loop didn't seem like a necessary use of a macro. The early return case can be handled by `break` without practically any changes to the logic inside the loop. All this code is from early 2014 (rust-lang#12062—nearly 8 years ago; pre-1.0) so it's possible there were compiler limitations that forced the macro way at the time. Confirmed that `x.py bench library/alloc --stage 0 --test-args from_utf8_lossy` is unaffected on my machine.

…onditional_recursion, r=xFrednet Fix false positive `unconditional_recursion` Fixes rust-lang#12052. Only checking if both variables are `local` was not enough, we also need to confirm they have the same type as `Self`. changelog: Fix false positive for `unconditional_recursion` lint

lilyball mentioned this pull request Feb 7, 2014

Need some way to parse &[u8] as UTF-8 with replacement chars #9516

Closed

huonw reviewed Feb 7, 2014
View reviewed changes

lilyball added 2 commits February 6, 2014 23:44

Hoist path::Display on top of from_utf8_lossy()

544cb42

bors closed this Feb 7, 2014

bors merged commit 544cb42 into rust-lang:master Feb 7, 2014

lilyball deleted the from_utf8_lossy branch February 7, 2014 18:04

pnkfelix mentioned this pull request Feb 8, 2014

io::IoError should carry info on the invalid byte sequence on non-utf8 InvalidInput #12113

Closed

dtolnay mentioned this pull request Nov 26, 2021

Eliminate bunch of copies of error codepath from Utf8LossyChunksIter #91244

Merged

Add new function str::from_utf8_lossy() #12062

Add new function str::from_utf8_lossy() #12062

Conversation

lilyball commented Feb 6, 2014

lifthrasiir commented Feb 6, 2014

huonw commented Feb 6, 2014

lifthrasiir commented Feb 6, 2014

huonw commented Feb 6, 2014

lilyball commented Feb 6, 2014

lilyball commented Feb 6, 2014

lilyball commented Feb 6, 2014

ghost commented Feb 6, 2014

lilyball commented Feb 6, 2014

lilyball commented Feb 6, 2014

lilyball commented Feb 6, 2014

lilyball commented Feb 6, 2014

lilyball commented Feb 7, 2014

Choose a reason for hiding this comment

lilyball commented Feb 7, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lilyball commented Feb 7, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huonw commented Feb 7, 2014

huonw commented Feb 7, 2014

huonw commented Feb 7, 2014

lilyball commented Feb 7, 2014

huonw commented Feb 7, 2014

lilyball commented Feb 7, 2014

lilyball commented Feb 7, 2014

lilyball commented Feb 7, 2014

huonw commented Feb 7, 2014

huonw commented Feb 7, 2014

lilyball commented Feb 7, 2014