Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native UTF-16 support #108

Closed
jfkthame opened this issue Oct 13, 2023 · 2 comments · Fixed by #109
Closed

Native UTF-16 support #108

jfkthame opened this issue Oct 13, 2023 · 2 comments · Fixed by #109

Comments

@jfkthame
Copy link
Contributor

To make it easier & more efficient to use unicode-bidi in an environment (such as Gecko) where text is handled as UTF-16, I would like to extend the API here to provide a UTF-16 interface, and do the processing directly on UTF-16 code units as an alternative to UTF-8 code units (bytes).

This would not change the existing API in any way, or affect existing users.

Proposal:

Introduce versions of the BidiInfo and InitialInfo structs where the text field is &[u16] instead of &str. I'm suggesting these could be named BidiInfoU16 and InitialInfoU16. Except for the type of their text, these will be identical to the existing UTF-8-based versions.

We'll also need ParagraphU16, because its info will be a &BidiInfoU16.

To allow the actual implementation of the bidi algorithm to be shared between the 8- and 16-bit versions of these structs, I propose a TextSource trait that abstracts access to and iteration over the text, with implementations for str and for [u16]. Only minor adaptation of the InitialInfo, BidiInfo, and Paragraph methods is needed to work with this.

@Manishearth Does this sound like a reasonable way forward? I have a prototype implementation working locally, which I can put up as a PR for review if you think the overall idea is acceptable.

One factor to consider is that while we know, when using the str-based API, that the text must be well-formed Unicode, this will not be the case for a [u16]-based API; there could be unpaired surrogate code units present. There are a few ways we could handle this:

(a) Require the text to be valid UTF-16; panic!() if unpaired surrogates are encountered
(b) Have the 16-bit methods return Result()s everywhere, so that invalid text can return an error
(c) Treat any unpaired surrogate as REPLACEMENT_CHARACTER for all bidi processing

I'm currently leaning toward (c), but happy to listen to arguments for other options.

@Manishearth
Copy link
Member

Yeah, I'm fine with this, though I may not have time to review it soon.

In general I would like this crate to be encoding agnostic (and also be able to support e.g. ill-formed UTF8).

A thing I would like to see solved here is #86: whatever we do to implement this should abstract over indexing well enough that we no longer need to care about it.

@jfkthame
Copy link
Contributor Author

I think we could easily adapt this to handle ill-formed UTF-8. We'd need to create an alternative API using [u8] instead of str; then we provide a suitable implementation of TextSource for [u8], and it should "just work".

We could then make the existing str API into a trivial shim on top of the [u8] API, provided the additional validity-checking is cheap enough to ignore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants