Tracking issue for char encoding methods #27784

alexcrichton · 2015-08-13T01:02:37Z

This is a tracking issue for the unstable unicode feature and the char::encode_utf{8,16} methods.

The interfaces here are a little wonky but are done for performance. It's not clear whether these need to be exported or not or if there's a better method to do so through iterators.

The text was updated successfully, but these errors were encountered:

SimonSapin · 2015-08-13T15:19:53Z

How about returning enums like enum OneOrTwo { One(u16), Two(u16, u16) } or enum Utf16Encoding { SingleCodeUnit(u16), SurrogatePair(u16, u16) }?

alexcrichton · 2015-08-13T16:24:01Z

Certainly possible, but there's also the question of ergonomics here in terms of what to do with that after you've got the information.

SimonSapin · 2015-08-13T16:54:11Z

I think that one form or another of this functionality that doesn’t require allocation should be exposed. Returning an iterator is nicer than taking &mut [_], but I don’t know about performance.

nagisa · 2015-08-13T17:12:36Z

I’ve suggested taking &mut [u8; 4]/&mut [u16; 2] and returning usize once. The major downside is inability to convert from slice to array.

Taking anything else than slice also makes following use case not as elegant as it is now:

let mut buffer = Vec::with_capacity(alot);
let mut idx = 0;
loop {
     idx += some_char().encode_utf8(&mut buffer[idx..]).unwrap();
}

@aturon

* Rename `Utf16Items` to `Utf16Decoder`. "Items" is meaningless. * Generalize it to any `u16` iterator, not just `[u16].iter()` * Make it yield `Result` instead of a custom `Utf16Item` enum that was isomorphic to `Result`. This enable using the `FromIterator for Result` impl. * Replace `Utf16Item::to_char_lossy` with a `Utf16Decoder::lossy` iterator adaptor. This is a [breaking change], but only for users of the unstable `rustc_unicode` crate. I’d like this functionality to be stabilized and re-exported in `std` eventually, as the "low-level equivalent" of `String::from_utf16` and `String::from_utf16_lossy` like #27784 is the low-level equivalent of #27714. CC @aturon, @alexcrichton

SimonSapin · 2015-10-28T00:49:34Z

How about returning something that both is an iterator and dereferences to a slice?

struct Utf8Char {
    bytes: [u8; 4],
    position: usize,
}

impl Deref for Utf8Char {
    type Target = [u8];
    fn deref(&self) -> &[u8] { &self.bytes[self.position..] }
}

impl Iterator for Utf8Char {
    type Item = u8;
    fn next(&mut self) -> Option<u8> {
        if self.position < self.bytes.len() {
            let byte = self.bytes[self.position];
            self.position += 1;
            Some(byte)
        } else {
            None
        }
    }
}

(“Short” code points have zeros as padding at the start of the array.)

… and similarly for UTF-16, but with [u16; 2] instead of [u8; 4].

BurntSushi · 2015-11-08T15:39:13Z

@SimonSapin That looks really sweet to me!

BurntSushi · 2016-01-20T22:03:13Z

@SimonSapin In your deref method, I think that should be &self.bytes[..self.position], right?

BurntSushi · 2016-01-20T22:07:51Z

@SimonSapin What do you think about also exposing decode_{utf8,utf16} methods? Basically, if you have some bytes and want the next encoded char out of it, today I think you need to decode into a string and then call chars, which is a bit roundabout (and does extra work I believe).

SimonSapin · 2016-01-21T08:36:51Z

No, deref returns the slice that hasn’t been consumed by the iterator yet. For code points that have less than 4 bytes to begin with, padding is at the start of the array, not the end.

We already have char::decode_utf16 that takes and returns iterators.

For UTF-8 I do want to expose a decoder that’s more low-level than what we currently have, but I’m not sure what it should look like. I have some experiments at https://github.com/SimonSapin/rust-utf8

BurntSushi · 2016-01-21T11:47:53Z

padding is at the start of the array, not the end.

Ah! That was what I missed. Thanks for the clarification.

I have some experiments at https://github.com/SimonSapin/rust-utf8

Interesting. That is much more complex than I had thought it would be! (I hadn't considered returning additional info about incomplete sequences.)

SimonSapin · 2016-01-21T16:27:03Z

Most of the complexity comes from self-imposed constraints:

Support “chunked” decoding so you can start processing, say, an HTML document before it’s finished downloading from the network. The bytes for a single char can be split across chunks.
Make it possible to emit &str slices that borrow &[u8] input bytes whenever possible, to avoid copying too many bytes.

I don’t know how much of that should be in the standard library.

But when the standard library gets performance improvement like #30740 (and perhaps more in the future with SIMD or something?), ideally they’d be in a low-level algorithm that everything else builds on top of.

alexcrichton · 2016-03-11T17:59:35Z

🔔 This issue is now entering its cycle-long final comment period for stabilization 🔔

The API proposed by @SimonSapin seems reasonable, perhaps in the form of:

impl char {
    fn encode_utf8(&self) -> EncodeUtf8;
}

struct EncodeUtf8 {
    // ...
}

impl Iterator for EncodeUtf8 {
    type Item = u8;
    // ...
}

impl EncodeUtf8 {
    #[unstable(...)]
    pub fn as_slice(&self) -> &[u8] { /* ... */ }
}

Currently these have non-traditional APIs which take a buffer and report how much was filled in, but they're not necessarily ergonomic to use. Returning an iterator which *also* exposes an underlying slice shouldn't result in any performance loss as it's just a lazy version of the same implementation, and it's also much more ergonomic! cc rust-lang#27784

…turon std: Change `encode_utf{8,16}` to return iterators Currently these have non-traditional APIs which take a buffer and report how much was filled in, but they're not necessarily ergonomic to use. Returning an iterator which *also* exposes an underlying slice shouldn't result in any performance loss as it's just a lazy version of the same implementation, and it's also much more ergonomic! cc #27784

Stebalien · 2016-03-23T23:31:30Z

Why add an as_slice method instead of implementing Deref and Index? I was under the impression that &thing[..] should always be prefered to thing.as_slice().

daschl · 2016-10-10T11:25:51Z

@alexcrichton not sure if this is in scope, but I have the use case where I want to write a "char" into a u8 array at a given offset. Right now I've copied the main part from stdlib and do it like so:

#[inline]
fn write_char_into_array(&self, offset: usize, ch: &char, array: &mut [u8]) -> bool {
    if (ch.len_utf8() + offset) > array.len() {
        return false;
    }

    let code = *ch as u32;
    if code < MAX_ONE_B {
        array[offset] = code as u8;
    } else if code < MAX_TWO_B {
        array[offset] = (code >> 6 & 0x1F) as u8 | TAG_TWO_B;
        array[offset + 1] = (code & 0x3F) as u8 | TAG_CONT;
    } else if code < MAX_THREE_B {
        array[offset] = (code >> 12 & 0x0F) as u8 | TAG_THREE_B;
        array[offset + 1] = (code >> 6 & 0x3F) as u8 | TAG_CONT;
        array[offset + 2] = (code & 0x3F) as u8 | TAG_CONT;
    } else {
        array[offset] = (code >> 18 & 0x07) as u8 | TAG_FOUR_B;
        array[offset + 1] = (code >> 12 & 0x3F) as u8 | TAG_CONT;
        array[offset + 2] = (code >> 6 & 0x3F) as u8 | TAG_CONT;
        array[offset + 3] = (code & 0x3F) as u8 | TAG_CONT;
    }
    true
}

Again not sure if this is in scope, but I wanted to bring it up in case such a use case makes sense to integrate.

SimonSapin · 2016-10-10T12:53:27Z

@daschl you can deal with the offset by slicing:

fn write_char_into_array(&self, offset: usize, ch: &char, array: &mut [u8]) -> bool {
    let slice = &mut array[offset..];
    if ch.len_utf8() > slice.len() {
        return false
    }
    ch.encode_utf8(slice);
    true
}

bluss · 2016-10-10T13:04:57Z

@alexcrichton All of your suggestions are good. &'a mut str seems like the most powerful one, and it allows recovering all the other ones.

I don't know if this data point is interesting, but ArrayString ended up copying this code too. When I shaped it after how it's used it ends up looking like

pub fn encode_utf8(ch: char, buf: &mut [u8]) -> Result<usize, EncodeError>

impl: https://github.com/bluss/arrayvec/blob/d43c959fa8afa912c497104b60a64892e1178f9d/src/char_ext.rs#L28-L51

use: https://github.com/bluss/arrayvec/blob/d43c959fa8afa912c497104b60a64892e1178f9d/src/array_string.rs#L114-L119

daschl · 2016-10-10T13:05:29Z

@SimonSapin ah good to know, thanks! lets hope it gets stable soon then

bluss · 2016-10-10T13:10:28Z

I would vote for no panic. That also simplifies the bounds checking (the impl linked above has all bounds checks elided, since buf.len() is being tested manually).

alexcrichton · 2016-11-01T23:12:25Z

@rfcbot fcp merge

These methods have been around for awhile, I propose we merge!

alexcrichton · 2016-11-01T23:12:42Z

er, stabilize*

rfcbot · 2016-11-01T23:48:53Z

Team member @alexcrichton has proposed to merge this. The next step is review by the rest of the tagged teams:

Concerns:

~~panic-vs-not-panic~~ resolved by Tracking issue for char encoding methods #27784 (comment)

Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

brson · 2016-11-04T00:32:28Z

It's not obvious to me that the current API is the one we want since there has been ongoing discussion. @bluss recently suggested it should not panic, and the current implementation does panic.

brson · 2016-11-04T00:32:48Z

@rfcbot concern panic-vs-not-panic

alexcrichton · 2016-11-04T02:04:39Z

I'm personally convinced by @SimonSapin's points above, which is that there is a statically known size (4 and 2) to encode all utf8 and utf16 characters. Callers basically always need to pass in that size buffer, so it seems more like a programmer error if you pass in a small buffer than a runtime error that should be handled.

SimonSapin · 2016-11-04T16:39:06Z

To reiterate: you can use 2 or 4 to keep things simple or to use a statically-sized array, but you can also use char::len_utf8 or char::len_utf16 to find the precise length that is needed.

bluss · 2016-11-04T18:33:48Z

I don't want to cause a gridlock, it was just what my perspective was in that moment. The fixed capacity use case is more uncommon, and the motivation for panic is following the usual conventions, so it certainly makes sense.

brson · 2016-11-10T23:16:28Z

OK, it sounds like there's no major opposition to the api as is.

brson · 2016-11-10T23:16:57Z

@rfcbot resolved panic-vs-not-panic

Kimundi · 2016-11-11T13:30:52Z

Apart from the panicking. I'm a bit confused right now about what the actual API/signature is going to be. The one that returns &mut str ?

alexcrichton · 2016-11-11T15:29:44Z

@Kimundi

Yeah encode_utf8 looks like:

fn encode_utf8(self, dst: &mut [u8]) -> &mut str

and encode_utf16 looks like:

fn encode_utf16(self, dst: &mut [u16]) -> &mut [u16]

Kimundi · 2016-11-11T15:46:07Z

Alright!

rfcbot · 2016-11-12T01:11:24Z

🔔 This is now entering its final comment period, as per the review above. 🔔

psst @alexcrichton, I wasn't able to add the final-comment-period label, please do so.

rfcbot · 2016-11-22T01:38:49Z

The final comment period is now complete.

Library stabilizations/deprecations for 1.15 release Stabilized: - `std::iter::Iterator::{min_by, max_by}` - `std::os::*::fs::FileExt` - `std::sync::atomic::Atomic*::{get_mut, into_inner}` - `std::vec::IntoIter::{as_slice, as_mut_slice}` - `std::sync::mpsc::Receiver::try_iter` - `std::os::unix::process::CommandExt::before_exec` - `std::rc::Rc::{strong_count, weak_count}` - `std::sync::Arc::{strong_count, weak_count}` - `std::char::{encode_utf8, encode_utf16}` - `std::cell::Ref::clone` - `std::io::Take::into_inner` Deprecated: - `std::rc::Rc::{would_unwrap, is_unique}` - `std::cell::RefCell::borrow_state` Closes #23755 Closes #27733 Closes #27746 Closes #27784 Closes #28356 Closes #31398 Closes #34931 Closes #35601 Closes #35603 Closes #35918 Closes #36105

rust-lang/rust#27784

alexcrichton added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. B-unstable Blocker: Implemented in the nightly compiler and unstable. labels Aug 13, 2015

This was referenced Aug 13, 2015

Tracking issue for str_utf16 stabilization #27714

Closed

Refactor low-level UTF-16 decoding and move Borrow(Mut) to libcore. #27808

Merged

Ms2ger mentioned this issue Aug 16, 2015

Tracking: Unstable Rust feature gates used by Servo servo/servo#5286

Closed

99 tasks

kkawakam mentioned this issue Jan 13, 2016

Compile against Rust Stable kkawakam/rustyline#17

Closed

5 tasks

alexcrichton added the I-nominated label Mar 9, 2016

alexcrichton added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed I-nominated labels Mar 11, 2016

alexcrichton mentioned this issue Mar 11, 2016

std: Change encode_utf{8,16} to return iterators #32204

Merged

alexcrichton added the final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. label Nov 12, 2016

aturon mentioned this issue Dec 14, 2016

Library stabilizations/deprecations for 1.15 release #38369

Merged

bors closed this as completed in #38369 Dec 18, 2016

gwenn added a commit to gwenn/rustyline that referenced this issue Dec 26, 2016

Unicode feature

0e23b95

rust-lang/rust#27784

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue for char encoding methods #27784

Tracking issue for char encoding methods #27784

alexcrichton commented Aug 13, 2015

SimonSapin commented Aug 13, 2015

alexcrichton commented Aug 13, 2015

SimonSapin commented Aug 13, 2015

nagisa commented Aug 13, 2015

SimonSapin commented Oct 28, 2015

BurntSushi commented Nov 8, 2015

BurntSushi commented Jan 20, 2016

BurntSushi commented Jan 20, 2016

SimonSapin commented Jan 21, 2016

BurntSushi commented Jan 21, 2016

SimonSapin commented Jan 21, 2016

alexcrichton commented Mar 11, 2016

Stebalien commented Mar 23, 2016

daschl commented Oct 10, 2016

SimonSapin commented Oct 10, 2016

bluss commented Oct 10, 2016

daschl commented Oct 10, 2016

bluss commented Oct 10, 2016

alexcrichton commented Nov 1, 2016

alexcrichton commented Nov 1, 2016

rfcbot commented Nov 1, 2016 •

edited

Loading

brson commented Nov 4, 2016

brson commented Nov 4, 2016

alexcrichton commented Nov 4, 2016

SimonSapin commented Nov 4, 2016

bluss commented Nov 4, 2016

brson commented Nov 10, 2016 •

edited

Loading

brson commented Nov 10, 2016

Kimundi commented Nov 11, 2016

alexcrichton commented Nov 11, 2016

Kimundi commented Nov 11, 2016

rfcbot commented Nov 12, 2016

rfcbot commented Nov 22, 2016

Tracking issue for char encoding methods #27784

Tracking issue for char encoding methods #27784

Comments

alexcrichton commented Aug 13, 2015

SimonSapin commented Aug 13, 2015

alexcrichton commented Aug 13, 2015

SimonSapin commented Aug 13, 2015

nagisa commented Aug 13, 2015

SimonSapin commented Oct 28, 2015

BurntSushi commented Nov 8, 2015

BurntSushi commented Jan 20, 2016

BurntSushi commented Jan 20, 2016

SimonSapin commented Jan 21, 2016

BurntSushi commented Jan 21, 2016

SimonSapin commented Jan 21, 2016

alexcrichton commented Mar 11, 2016

Stebalien commented Mar 23, 2016

daschl commented Oct 10, 2016

SimonSapin commented Oct 10, 2016

bluss commented Oct 10, 2016

daschl commented Oct 10, 2016

bluss commented Oct 10, 2016

alexcrichton commented Nov 1, 2016

alexcrichton commented Nov 1, 2016

rfcbot commented Nov 1, 2016 • edited Loading

brson commented Nov 4, 2016

brson commented Nov 4, 2016

alexcrichton commented Nov 4, 2016

SimonSapin commented Nov 4, 2016

bluss commented Nov 4, 2016

brson commented Nov 10, 2016 • edited Loading

brson commented Nov 10, 2016

Kimundi commented Nov 11, 2016

alexcrichton commented Nov 11, 2016

Kimundi commented Nov 11, 2016

rfcbot commented Nov 12, 2016

rfcbot commented Nov 22, 2016

rfcbot commented Nov 1, 2016 •

edited

Loading

brson commented Nov 10, 2016 •

edited

Loading