Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is TinyStr ASCII-only? #16

Closed
sffc opened this issue Jun 3, 2020 · 5 comments
Closed

Why is TinyStr ASCII-only? #16

sffc opened this issue Jun 3, 2020 · 5 comments

Comments

@sffc
Copy link
Collaborator

sffc commented Jun 3, 2020

It would be useful (unicode-org/icu4x#61 (comment)) to be able to use TinyStr for UTF-8 data, not only ASCII data. What was the design decision to make it support ASCII-only? How hard would it be to extend it to support UTF-8?

@zbraniecki
Copy link
Owner

@raphlinus designed it this way in projectfluent/fluent-langneg-rs#8 and I just took it because perf/mem wins very outstanding so it fits all the needs for Locale.

I'm wondering if there should be some refactor in TinyStr to handle UTF-8 maybe and then the current ones would be renamed TinyASCIIStr or TinyASCIIString (since it really is owned value).

@zbraniecki
Copy link
Owner

My concern is that all methods on TinyStr are very ascii specific and their perf is great because they are simple bitmasks. if we had TinyUTF8String those wouldn't have the methods. Not sure if it's important.

@sffc
Copy link
Collaborator Author

sffc commented Jun 3, 2020

The main use case I see would be to store grapheme clusters, which tend to be short but could be quite long. So, a TinyStrUTF8 could be like TinyStrAuto, where it spills onto the heap if it's too long. I don't know if it's necessary to have the stack-only version.

@zbraniecki
Copy link
Owner

Hmm, maybe it warrants a rewrite of TinyStr or a new lib that TinyStr will get folded into?

Or maybe one of the other small string crates will be a good match - SmolStr, ArrayString, smallstr and istring?
Perf is here: https://github.com/zbraniecki/tinystr/wiki/Performance

@raphlinus
Copy link
Collaborator

The good performance of TinyStr is precisely because the length is bounded, so there are no branches. I think you're looking for one of the other small string variants as Zibi mentioned above.

It would in theory be possible to extend it to bounded-length UTF-8, but some of the logic (case conversions, which are important for locale tags) depend pretty strongly on the ASCII-ness (7th bit clear).

@sffc sffc closed this as completed Jul 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants