Native UTF-16 support #108

jfkthame · 2023-10-13T15:13:06Z

To make it easier & more efficient to use unicode-bidi in an environment (such as Gecko) where text is handled as UTF-16, I would like to extend the API here to provide a UTF-16 interface, and do the processing directly on UTF-16 code units as an alternative to UTF-8 code units (bytes).

This would not change the existing API in any way, or affect existing users.

Proposal:

Introduce versions of the BidiInfo and InitialInfo structs where the text field is &[u16] instead of &str. I'm suggesting these could be named BidiInfoU16 and InitialInfoU16. Except for the type of their text, these will be identical to the existing UTF-8-based versions.

We'll also need ParagraphU16, because its info will be a &BidiInfoU16.

To allow the actual implementation of the bidi algorithm to be shared between the 8- and 16-bit versions of these structs, I propose a TextSource trait that abstracts access to and iteration over the text, with implementations for str and for [u16]. Only minor adaptation of the InitialInfo, BidiInfo, and Paragraph methods is needed to work with this.

@Manishearth Does this sound like a reasonable way forward? I have a prototype implementation working locally, which I can put up as a PR for review if you think the overall idea is acceptable.

One factor to consider is that while we know, when using the str-based API, that the text must be well-formed Unicode, this will not be the case for a [u16]-based API; there could be unpaired surrogate code units present. There are a few ways we could handle this:

(a) Require the text to be valid UTF-16; panic!() if unpaired surrogates are encountered
(b) Have the 16-bit methods return Result()s everywhere, so that invalid text can return an error
(c) Treat any unpaired surrogate as REPLACEMENT_CHARACTER for all bidi processing

I'm currently leaning toward (c), but happy to listen to arguments for other options.

The text was updated successfully, but these errors were encountered:

Manishearth · 2023-10-16T15:45:00Z

Yeah, I'm fine with this, though I may not have time to review it soon.

In general I would like this crate to be encoding agnostic (and also be able to support e.g. ill-formed UTF8).

A thing I would like to see solved here is #86: whatever we do to implement this should abstract over indexing well enough that we no longer need to care about it.

jfkthame · 2023-10-16T17:23:08Z

I think we could easily adapt this to handle ill-formed UTF-8. We'd need to create an alternative API using [u8] instead of str; then we provide a suitable implementation of TextSource for [u8], and it should "just work".

We could then make the existing str API into a trivial shim on top of the [u8] API, provided the additional validity-checking is cheap enough to ignore.

jfkthame mentioned this issue Oct 16, 2023

UTF-16 support #109

Merged

Manishearth closed this as completed in #109 Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native UTF-16 support #108

Native UTF-16 support #108

jfkthame commented Oct 13, 2023

Manishearth commented Oct 16, 2023

jfkthame commented Oct 16, 2023

Native UTF-16 support #108

Native UTF-16 support #108

Comments

jfkthame commented Oct 13, 2023

Manishearth commented Oct 16, 2023

jfkthame commented Oct 16, 2023