Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new function str::from_utf8_lossy() #12062

Merged
merged 2 commits into from
Feb 7, 2014

Conversation

lilyball
Copy link
Contributor

@lilyball lilyball commented Feb 6, 2014

from_utf8_lossy() takes a byte vector and produces a ~str, converting
any invalid UTF-8 sequence into the U+FFFD REPLACEMENT CHARACTER.

The replacement follows the guidelines in §5.22 Best Practice for U+FFFD
Substitution from the Unicode Standard (Version 6.2)1, which also
matches the WHATWG rules for utf-8 decoding2.

Closes #9516.

@lifthrasiir
Copy link
Contributor

Care is taken to try to convert invalid UTF-8 sequences into as few replacement characters as are needed.

This is different from the behavior of the WHATWG Encoding standard, which converts the sequence up to the first bad byte into one replacement character and continues to process the following bytes at the initial state. While there is no particular reason to use this algorithm (the number of replacement characters is arbitrary anyway) there is also no reason to diverge from the WHATWG standard.

@huonw
Copy link
Member

huonw commented Feb 6, 2014

@lifthrasiir is it worth just pulling your encoding library into the main tree now that we have nice crates? (Assuming it handles cases like this.)

@lifthrasiir
Copy link
Contributor

@huonw Except that the library is not quite matured enough, I do not object. Maybe we can just put #[unstable] for now.

@huonw
Copy link
Member

huonw commented Feb 6, 2014

I think having a full featured lib that gets kept up to date and (hopefully) incrementally improved would be better than patching over holes with functions like this. (And it also allows the new URL module to be pulled in, since, iirc, it depends on rust-encoding?)

Others may disagree...

@lilyball
Copy link
Contributor Author

lilyball commented Feb 6, 2014

@lifthrasiir I was trying to find that link and couldn't find it. Thanks.

@huonw Having encodings in the main tree would be nice, although I am slightly worried about the size of the encoding tables (I'm not familiar with @lifthrasiir's library in particular but I assume it uses encoding tables like everything else). I do think that support UTF-8 like this can be done without having a full encoding library.

@lilyball
Copy link
Contributor Author

lilyball commented Feb 6, 2014

@lifthrasiir After reading that algorithm, that's pretty similar to what I'm doing. The two big differences:

  1. My algorithm supports overlong encodings and re-encodes them in the canonical length.
  2. That algorithm I think is going to emit 3 U+FFFDs for a surrogate character, which seems a bit absurd to me. My algorithm only emits one. Still, this does follow the rule defined in §5.22 of the Unicode standard.

I can modify this implementation to conform to that algorithm. However, I did intentionally decide to support overlong encodings. Is that a decision I should reverse?

@lilyball
Copy link
Contributor Author

lilyball commented Feb 6, 2014

Implementing the WHATWG algorithm is pretty straightforward, but unfortunately a straightforward implementation is almost 30% slower on multibyte characters than what I already have 😠 But if we think that behaving the same as the WHATWG algorithm is desired this can surely be optimized.

@ghost
Copy link

ghost commented Feb 6, 2014

RFC 3629 says:

Implementations of the decoding algorithm above MUST protect against
decoding invalid sequences. For instance, a naive implementation may
decode the overlong UTF-8 sequence C0 80 into the character U+0000,
or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding
invalid sequences may have security consequences or cause other
problems.

and

Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F. This last exploit has
actually been used in a widespread virus attacking Web servers in
2001; thus, the security threat is very real.

@lilyball
Copy link
Contributor Author

lilyball commented Feb 6, 2014

@Jurily Yes, but I'm not decoding into codepoints, I'm decoding back into another UTF-8 stream. I originally decided to support overlong encodings because that way someone could decide that, for whatever reason, they need to be able to use the C0 80 encoding of U+0000, and then could handle it with this function. The idea is that any security filtering of the stream can happen on the re-encoded utf-8 stream.

However, I suspect that this is an edge case that it's probably better not to support. So I'm comfortable saying that we definitely should not support overlong encodings at this point.

@lilyball
Copy link
Contributor Author

lilyball commented Feb 6, 2014

Rewriting the algorithm to be WHATWG-equivalent without actually using the WHATWG algorithm ends up being up to 25% faster in the multibyte case than my original proposed algorithm (which makes it significantly better than the WHATWG version).

@lilyball
Copy link
Contributor Author

lilyball commented Feb 6, 2014

I've pushed the new version. It now matches the WHATWG algorithm's output, and no longer allows for overlong encodings.

@lilyball
Copy link
Contributor Author

lilyball commented Feb 6, 2014

@huonw To expand on my previous comment, I think that providing a full encodings library in the core distribution is a good idea, but it should remain in its own crate. On the other hand, UTF-8 decoding is so important to Rust (because its native str type requires UTF-8) that a version of the decoder that supports U+FFFD should be provided in libstd.

@lilyball
Copy link
Contributor Author

lilyball commented Feb 7, 2014

This PR will close #9516.

)

static TAG_CONT_U8: u8 = 128u8;

/// Converts a vector of bytes to a new utf-8 string.
/// Any invalid utf-8 sequences are replaced with U+FFFD REPLACEMENT CHARACTER.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an example or two?

@lilyball
Copy link
Contributor Author

lilyball commented Feb 7, 2014

@huonw Example added.

let mut res = with_capacity(total);

loop {
if i >= total {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while i <= total?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huonw There was a reason for this, which was first that I had a label on the original loop, and then the new version also did some work if i >= total beyond breaking, but you're right, in the current incarnation it can go back to the original while loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's probably faster to use

let mut ptr = v.as_ptr();
let end = ptr.offset(v.len());

while ptr < end {
   // ...
   let byte = *ptr;
   ptr = ptr.offset(1);
   // ...
}

Although safe_get would require some modification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is patterned off of how is_utf8() works, which was modified to its current incarnation after performance testing. I am curious as to what the performance impact of using your suggestion is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the bounds checking part of safe_get would be replaced by if !have_enough_space(p, end, w) { /* use REPLACEMENT */ } called once, before the match so the other indexes can be unchecked.

Where have_enough_space would be w <= end as uint - p as uint, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It explicitly doesn't work that way, because if w is 4 but I only have 3 bytes left, those bytes could be something like F0 49 50, which should evaluate to "\uFFFDIJ". In other words, even if there's not enough room, I still have to validate each byte.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see... maybe have a slow path if there isn't enough space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. I'll have to do some testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pointer arithmetic did not help.

Before (as submitted):

test str::bench::from_utf8_lossy_100_ascii                      ... bench:       261 ns/iter (+/- 32)
test str::bench::from_utf8_lossy_100_multibyte                  ... bench:       440 ns/iter (+/- 48)
test str::bench::from_utf8_lossy_invalid                        ... bench:        89 ns/iter (+/- 5)

After:

test str::bench::from_utf8_lossy_100_ascii                      ... bench:       272 ns/iter (+/- 32)
test str::bench::from_utf8_lossy_100_multibyte                  ... bench:       447 ns/iter (+/- 48)
test str::bench::from_utf8_lossy_invalid                        ... bench:        95 ns/iter (+/- 9)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that surprises me slightly, but the numbers don't lie.

@lilyball
Copy link
Contributor Author

lilyball commented Feb 7, 2014

I changed loop to while, and removed the unnecessary use of return at the end, but otherwise left it alone. In my tests I couldn't get it to run faster even with what I thought would obviously work, such as using ~[u8] directly instead of ~str (since the raw methods I'm using just end up calling as_owned_vec() first). That case sped up the normal scenario of valid strings slightly, but it significantly slowed down the invalid case. I also tried pointer arithmetic, with no difference, and I tried counting the number of replacements first (to be able to allocate the new string with the correct size) and removing all the checks on the pushes, and that was (quite surprisingly) slower across the board.

Given that, I'm leaving the algorithm alone. It's good enough.

let xs = bytes!("ศไทย中华Việt Nam");
assert_eq!(from_utf8_lossy(xs), ~"ศไทย中华Việt Nam");

let xs = bytes!("Hello", 0xC0, 0x80, " There", 0xE6, 0x83, " Goodbye");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't seem to be a test for a two byte character with only the first byte?

Also, how about one for completely invalid bytes (e.g. 0xFF)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about a two-byte char test, but I already test 0xF5 which is an invalid byte.

@huonw
Copy link
Member

huonw commented Feb 7, 2014

Possible extension (which I don't think should block an r+): return enum MaybeOwned<'a> { Slice(&'a str), Owned(~str) } that only allocates when absolutely required, i.e. basically have the is_utf8 validation loop, and if it finds something invalid break, copy the (known good) string up until that point and fall back to the replacement character loop you've written here. If the validation loop passes, it can return Slice(transmute(bytes)), otherwise it allocates.

An possible (internal) optimisation generalising this: record the last known good character and only copy when necessary (i.e. when you see a invalid byte, copy up until that point, write the replacement character and record the current position).

@huonw
Copy link
Member

huonw commented Feb 7, 2014

(r=me with the few extra tests.)

@huonw
Copy link
Member

huonw commented Feb 7, 2014

Thinking about it (warning: stream of thought): this really wants a rope-style data structure that stores segments of &'a str with counts of replacement characters between them, preferably as an iterator.

(I guess we could actually have a "long" "\uFFFD\uFFFD...\uFFFD" static string somewhere and just use slices out of this for the sequences of replacement characters, so the return value could literally be just an Iterator<&'a str>.)

The extra complications will probably offset the zero-allocations in terms of performance.

@lilyball
Copy link
Contributor Author

lilyball commented Feb 7, 2014

@huonw MaybeOwned is an interesting suggestion, I've thought about something similar. The problem is it's just so unwieldy to use. Implementing Str on it would make things easier, though. Still, if memory usage is an issue, callers can always try from_utf8() first, which is pretty fast.

I also considered doing what you suggested of noting the last known good character. It's probably worth testing, but given the results of my previous performance experiments, I wouldn't be surprised if this turns out to be no good as well. Still, it's probably worth a go.

A rope-like data structure would be a nice approach for this, but of course that would be a separate implementation entirely as str is not a rope.

@huonw
Copy link
Member

huonw commented Feb 7, 2014

High-level sketch:

struct LossyUTF8Iter<'a> { bytes: &'a [u8] }
fn lossy_utf8(&'a [u8]) -> LossyUTF8Iter<'a> { /* obvious */ }

impl<'a> Iterator<&'a str> for LossyUTF8Iter<'a> {
    fn next(&mut self) -> Option<&'a str> {
        static REP: &'static str = "\uFFFD\uFFFD\uFFFD\uFFFD"; // some reasonable length

        let good = /* validate first character */;

        if good {
            /* find first invalid byte... */

            return Some(self.bytes.slice_to(index));
        } else {
            /* find first valid character (or 
               REP.char_len() invalid characters,
               whichever comes first)... */

            return Some(REP.slice_to(3 * replacement_count));
        }
    }
}

// helper, not necessary (and won't work yet)
fn slice_if_possible<'a, It: Iterator<&'a str>>(it: It) -> MaybeOwned<'a> {
     // if it has 1 element, return that; otherwise concatentate them all in a new alloc.
}

This is probably fast for long runs of valid or invalid bytes, but maybe not for many isolated invalid bytes. (I suppose it could also just immediately return "\uFFFD" as soon as it finds a single invalid codepoint rather than try to batch them.)

@lilyball
Copy link
Contributor Author

lilyball commented Feb 7, 2014

With the current algorithm:

test str::bench::from_utf8_lossy_100_ascii                      ... bench:       246 ns/iter (+/- 12)
test str::bench::from_utf8_lossy_100_invalid                    ... bench:       299 ns/iter (+/- 10)
test str::bench::from_utf8_lossy_100_multibyte                  ... bench:       415 ns/iter (+/- 11)
test str::bench::from_utf8_lossy_invalid                        ... bench:        87 ns/iter (+/- 6)

If I implement the "last good char" suggestion:

test str::bench::from_utf8_lossy_100_ascii                      ... bench:       185 ns/iter (+/- 3)
test str::bench::from_utf8_lossy_100_invalid                    ... bench:       348 ns/iter (+/- 20)
test str::bench::from_utf8_lossy_100_multibyte                  ... bench:       235 ns/iter (+/- 32)
test str::bench::from_utf8_lossy_invalid                        ... bench:        98 ns/iter (+/- 1)

The full invalid case is about 16% slower, but the valid cases are faster (in the case of the multibyte test, significantly faster) so I think it's a win.

from_utf8_lossy() takes a byte vector and produces a ~str, converting
any invalid UTF-8 sequence into the U+FFFD REPLACEMENT CHARACTER.

The replacement follows the guidelines in §5.22 Best Practice for U+FFFD
Substitution from the Unicode Standard (Version 6.2)[1], which also
matches the WHATWG rules for utf-8 decoding[2].

[1]: http://www.unicode.org/versions/Unicode6.2.0/ch05.pdf
[2]: http://encoding.spec.whatwg.org/#utf-8
@lilyball
Copy link
Contributor Author

lilyball commented Feb 7, 2014

Just pushed the "last good character" version, along with a new test for 2-byte characters.

@lilyball
Copy link
Contributor Author

lilyball commented Feb 7, 2014

@huonw I like the idea of the iterator, but it just doesn't seem terribly practical. I typically need to work with a full string, not an iterator of adjacent substrings. It's possibly worth exploring separately, but it would be a parallel implementation rather than a replacement for from_utf8_lossy().

@huonw
Copy link
Member

huonw commented Feb 7, 2014

With some sort of -> ~str (or equivalently, -> MaybeOwned) method it might not be so bad (e.g. if FromIterator was implemented). But yeah, it would require some finesse to be practical.

bors added a commit that referenced this pull request Feb 7, 2014
`from_utf8_lossy()` takes a byte vector and produces a `~str`, converting
any invalid UTF-8 sequence into the U+FFFD REPLACEMENT CHARACTER.

The replacement follows the guidelines in §5.22 Best Practice for U+FFFD
Substitution from the Unicode Standard (Version 6.2)[1], which also
matches the WHATWG rules for utf-8 decoding[2].

[1]: http://www.unicode.org/versions/Unicode6.2.0/ch05.pdf
[2]: http://encoding.spec.whatwg.org/#utf-8

Closes #9516.
@bors bors closed this Feb 7, 2014
@bors bors merged commit 544cb42 into rust-lang:master Feb 7, 2014
@huonw
Copy link
Member

huonw commented Feb 7, 2014

I wrote an iterator version and it's much faster when not allocating, and slightly-fast-to-a-lot-slower when allocating.

iter iter_collect str
from_utf8_lossy_100_ascii 126 197 188
from_utf8_lossy_100_invalid 300 371 384
from_utf8_lossy_100_multibyte 162 232 227
from_utf8_lossy_invalid 68 211 121

Also, I have a feeling that Show for Display should be doing something like

if /* no padding */ {
    for s in str::from_utf8_lossy(self.as_bytes()) {
        f.buf.write_str(s)
    }
} else {
    let mut collected = str::with_capacity(...);
    // could micro-optimise the case when self is valid utf8 to 
    // (equivalent of) just f.pad(self)
    for s in str::from_utf8_lossy(self.as_bytes()) { collected.push_str(s) }
    f.pad(collected);
}

and, if necessary, ToStr can delegate to Show via format!("{}", *self).

@lilyball
Copy link
Contributor Author

lilyball commented Feb 7, 2014

@huonw The iterator version is interesting, I'm just concerned that nearly all callers are going to need to collect it, at which point it's no better than the str version. Pretty much the only time you can use the iterator version as-is is if you're immediately writing it to some sort of Writer, as in your sample Show impl.

@lilyball lilyball deleted the from_utf8_lossy branch February 7, 2014 18:04
bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 30, 2021
Eliminate bunch of copies of error codepath from Utf8LossyChunksIter

Using a macro to stamp out 7 identical copies of the nontrivial slicing logic to exit this loop didn't seem like a necessary use of a macro. The early return case can be handled by `break` without practically any changes to the logic inside the loop.

All this code is from early 2014 (rust-lang#12062&mdash;nearly 8 years ago; pre-1.0) so it's possible there were compiler limitations that forced the macro way at the time.

Confirmed that `x.py bench library/alloc --stage 0 --test-args from_utf8_lossy` is unaffected on my machine.
flip1995 pushed a commit to flip1995/rust that referenced this pull request Jan 11, 2024
…onditional_recursion, r=xFrednet

Fix false positive `unconditional_recursion`

Fixes rust-lang#12052.

Only checking if both variables are `local` was not enough, we also need to confirm they have the same type as `Self`.

changelog: Fix false positive for `unconditional_recursion` lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need some way to parse &[u8] as UTF-8 with replacement chars
4 participants