-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new function str::from_utf8_lossy() #12062
Conversation
This is different from the behavior of the WHATWG Encoding standard, which converts the sequence up to the first bad byte into one replacement character and continues to process the following bytes at the initial state. While there is no particular reason to use this algorithm (the number of replacement characters is arbitrary anyway) there is also no reason to diverge from the WHATWG standard. |
@lifthrasiir is it worth just pulling your encoding library into the main tree now that we have nice crates? (Assuming it handles cases like this.) |
@huonw Except that the library is not quite matured enough, I do not object. Maybe we can just put |
I think having a full featured lib that gets kept up to date and (hopefully) incrementally improved would be better than patching over holes with functions like this. (And it also allows the new URL module to be pulled in, since, iirc, it depends on rust-encoding?) Others may disagree... |
@lifthrasiir I was trying to find that link and couldn't find it. Thanks. @huonw Having encodings in the main tree would be nice, although I am slightly worried about the size of the encoding tables (I'm not familiar with @lifthrasiir's library in particular but I assume it uses encoding tables like everything else). I do think that support UTF-8 like this can be done without having a full encoding library. |
@lifthrasiir After reading that algorithm, that's pretty similar to what I'm doing. The two big differences:
I can modify this implementation to conform to that algorithm. However, I did intentionally decide to support overlong encodings. Is that a decision I should reverse? |
Implementing the WHATWG algorithm is pretty straightforward, but unfortunately a straightforward implementation is almost 30% slower on multibyte characters than what I already have 😠 But if we think that behaving the same as the WHATWG algorithm is desired this can surely be optimized. |
RFC 3629 says:
and
|
@Jurily Yes, but I'm not decoding into codepoints, I'm decoding back into another UTF-8 stream. I originally decided to support overlong encodings because that way someone could decide that, for whatever reason, they need to be able to use the However, I suspect that this is an edge case that it's probably better not to support. So I'm comfortable saying that we definitely should not support overlong encodings at this point. |
Rewriting the algorithm to be WHATWG-equivalent without actually using the WHATWG algorithm ends up being up to 25% faster in the multibyte case than my original proposed algorithm (which makes it significantly better than the WHATWG version). |
I've pushed the new version. It now matches the WHATWG algorithm's output, and no longer allows for overlong encodings. |
@huonw To expand on my previous comment, I think that providing a full encodings library in the core distribution is a good idea, but it should remain in its own crate. On the other hand, UTF-8 decoding is so important to Rust (because its native |
This PR will close #9516. |
) | ||
|
||
static TAG_CONT_U8: u8 = 128u8; | ||
|
||
/// Converts a vector of bytes to a new utf-8 string. | ||
/// Any invalid utf-8 sequences are replaced with U+FFFD REPLACEMENT CHARACTER. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an example or two?
@huonw Example added. |
let mut res = with_capacity(total); | ||
|
||
loop { | ||
if i >= total { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while i <= total
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huonw There was a reason for this, which was first that I had a label on the original loop, and then the new version also did some work if i >= total
beyond breaking, but you're right, in the current incarnation it can go back to the original while
loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it's probably faster to use
let mut ptr = v.as_ptr();
let end = ptr.offset(v.len());
while ptr < end {
// ...
let byte = *ptr;
ptr = ptr.offset(1);
// ...
}
Although safe_get
would require some modification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation is patterned off of how is_utf8()
works, which was modified to its current incarnation after performance testing. I am curious as to what the performance impact of using your suggestion is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the bounds checking part of safe_get
would be replaced by if !have_enough_space(p, end, w) { /* use REPLACEMENT */ }
called once, before the match
so the other indexes can be unchecked.
Where have_enough_space
would be w <= end as uint - p as uint
, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It explicitly doesn't work that way, because if w
is 4
but I only have 3 bytes left, those bytes could be something like F0 49 50
, which should evaluate to "\uFFFDIJ"
. In other words, even if there's not enough room, I still have to validate each byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see... maybe have a slow path if there isn't enough space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps. I'll have to do some testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using pointer arithmetic did not help.
Before (as submitted):
test str::bench::from_utf8_lossy_100_ascii ... bench: 261 ns/iter (+/- 32)
test str::bench::from_utf8_lossy_100_multibyte ... bench: 440 ns/iter (+/- 48)
test str::bench::from_utf8_lossy_invalid ... bench: 89 ns/iter (+/- 5)
After:
test str::bench::from_utf8_lossy_100_ascii ... bench: 272 ns/iter (+/- 32)
test str::bench::from_utf8_lossy_100_multibyte ... bench: 447 ns/iter (+/- 48)
test str::bench::from_utf8_lossy_invalid ... bench: 95 ns/iter (+/- 9)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that surprises me slightly, but the numbers don't lie.
I changed Given that, I'm leaving the algorithm alone. It's good enough. |
let xs = bytes!("ศไทย中华Việt Nam"); | ||
assert_eq!(from_utf8_lossy(xs), ~"ศไทย中华Việt Nam"); | ||
|
||
let xs = bytes!("Hello", 0xC0, 0x80, " There", 0xE6, 0x83, " Goodbye"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't seem to be a test for a two byte character with only the first byte?
Also, how about one for completely invalid bytes (e.g. 0xFF)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about a two-byte char test, but I already test 0xF5
which is an invalid byte.
Possible extension (which I don't think should block an r+): return An possible (internal) optimisation generalising this: record the last known good character and only copy when necessary (i.e. when you see a invalid byte, copy up until that point, write the replacement character and record the current position). |
(r=me with the few extra tests.) |
Thinking about it (warning: stream of thought): this really wants a rope-style data structure that stores segments of (I guess we could actually have a "long" The extra complications will probably offset the zero-allocations in terms of performance. |
@huonw I also considered doing what you suggested of noting the last known good character. It's probably worth testing, but given the results of my previous performance experiments, I wouldn't be surprised if this turns out to be no good as well. Still, it's probably worth a go. A rope-like data structure would be a nice approach for this, but of course that would be a separate implementation entirely as |
High-level sketch: struct LossyUTF8Iter<'a> { bytes: &'a [u8] }
fn lossy_utf8(&'a [u8]) -> LossyUTF8Iter<'a> { /* obvious */ }
impl<'a> Iterator<&'a str> for LossyUTF8Iter<'a> {
fn next(&mut self) -> Option<&'a str> {
static REP: &'static str = "\uFFFD\uFFFD\uFFFD\uFFFD"; // some reasonable length
let good = /* validate first character */;
if good {
/* find first invalid byte... */
return Some(self.bytes.slice_to(index));
} else {
/* find first valid character (or
REP.char_len() invalid characters,
whichever comes first)... */
return Some(REP.slice_to(3 * replacement_count));
}
}
}
// helper, not necessary (and won't work yet)
fn slice_if_possible<'a, It: Iterator<&'a str>>(it: It) -> MaybeOwned<'a> {
// if it has 1 element, return that; otherwise concatentate them all in a new alloc.
} This is probably fast for long runs of valid or invalid bytes, but maybe not for many isolated invalid bytes. (I suppose it could also just immediately return |
With the current algorithm:
If I implement the "last good char" suggestion:
The full invalid case is about 16% slower, but the valid cases are faster (in the case of the multibyte test, significantly faster) so I think it's a win. |
from_utf8_lossy() takes a byte vector and produces a ~str, converting any invalid UTF-8 sequence into the U+FFFD REPLACEMENT CHARACTER. The replacement follows the guidelines in §5.22 Best Practice for U+FFFD Substitution from the Unicode Standard (Version 6.2)[1], which also matches the WHATWG rules for utf-8 decoding[2]. [1]: http://www.unicode.org/versions/Unicode6.2.0/ch05.pdf [2]: http://encoding.spec.whatwg.org/#utf-8
Just pushed the "last good character" version, along with a new test for 2-byte characters. |
@huonw I like the idea of the iterator, but it just doesn't seem terribly practical. I typically need to work with a full string, not an iterator of adjacent substrings. It's possibly worth exploring separately, but it would be a parallel implementation rather than a replacement for |
With some sort of |
`from_utf8_lossy()` takes a byte vector and produces a `~str`, converting any invalid UTF-8 sequence into the U+FFFD REPLACEMENT CHARACTER. The replacement follows the guidelines in §5.22 Best Practice for U+FFFD Substitution from the Unicode Standard (Version 6.2)[1], which also matches the WHATWG rules for utf-8 decoding[2]. [1]: http://www.unicode.org/versions/Unicode6.2.0/ch05.pdf [2]: http://encoding.spec.whatwg.org/#utf-8 Closes #9516.
I wrote an iterator version and it's much faster when not allocating, and slightly-fast-to-a-lot-slower when allocating.
Also, I have a feeling that if /* no padding */ {
for s in str::from_utf8_lossy(self.as_bytes()) {
f.buf.write_str(s)
}
} else {
let mut collected = str::with_capacity(...);
// could micro-optimise the case when self is valid utf8 to
// (equivalent of) just f.pad(self)
for s in str::from_utf8_lossy(self.as_bytes()) { collected.push_str(s) }
f.pad(collected);
} and, if necessary, |
@huonw The iterator version is interesting, I'm just concerned that nearly all callers are going to need to collect it, at which point it's no better than the str version. Pretty much the only time you can use the iterator version as-is is if you're immediately writing it to some sort of |
Eliminate bunch of copies of error codepath from Utf8LossyChunksIter Using a macro to stamp out 7 identical copies of the nontrivial slicing logic to exit this loop didn't seem like a necessary use of a macro. The early return case can be handled by `break` without practically any changes to the logic inside the loop. All this code is from early 2014 (rust-lang#12062—nearly 8 years ago; pre-1.0) so it's possible there were compiler limitations that forced the macro way at the time. Confirmed that `x.py bench library/alloc --stage 0 --test-args from_utf8_lossy` is unaffected on my machine.
…onditional_recursion, r=xFrednet Fix false positive `unconditional_recursion` Fixes rust-lang#12052. Only checking if both variables are `local` was not enough, we also need to confirm they have the same type as `Self`. changelog: Fix false positive for `unconditional_recursion` lint
from_utf8_lossy()
takes a byte vector and produces a~str
, convertingany invalid UTF-8 sequence into the U+FFFD REPLACEMENT CHARACTER.
The replacement follows the guidelines in §5.22 Best Practice for U+FFFD
Substitution from the Unicode Standard (Version 6.2)1, which also
matches the WHATWG rules for utf-8 decoding2.
Closes #9516.