-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideas for improving performance #7
Comments
Your tinystr sounds very similar to tendril. |
Similar goals, but a lot smaller (4 bytes vs 16) and optimized for some operations, for example clone and eq are <1ns in tinystr and (1, 5)ns each in tendril, after quick benchmarking. |
Tendril is refcounted but tinystring wouldn't I suppose? If so it makes sense to have a separate crate for it... I've seen some SmallString (SmolString?) somewhere akin to SmallVec, but I don't recall where... Maybe in rustc? May be worth looking into it as well |
Thank you for this proposal! I'd love to see the PR and am open to such change. I was thinking recently about further specifying this crate as operating on Unicode BCP 47 Locale Identifier rather than BCP 47 Language Tag, which is what I've been going for already, but didn't specify it. The other thing I wanted to do was to switch to Cow to let us use string slices out of the original string, when possible. So for example:
but
Would that work with your tinystr? Would it make sense? |
@zbraniecki Thanks, sounds like it makes sense to at least start working on this. The motivation for tinystr here rather than something else is that it doesn't need a fallback for longer strings, because the spec mercifully has length limits. This has several advantages over a Cow, or strings borrowed from a serialization (as in rust-language-tags):
Another fun feature of a tinystr is that I will be able to do capitalization using SIMD-like math. If it's known to be alpha only, it's
As you can see, I love doing this bit-level optimization :) |
We also have https://github.com/servo/string-cache where |
I have a patch that adds the types but doesn't use them apart from tests. Should I upload that, or combine with changes to the |
This commit adds a compact string representation, but doesn't wire them up. Part of the plan for projectfluent#7
hmm, what's your reason against releasing tinystr as a crate and using it in fluent-locale? |
I'm not opposed, it's just that I think its use case is extremely specialized to bcp-47. I asked on #servo and #rust and people weren't convinced it would be more generally useful. If we do find a use case, or if it's your preference, I'm more than happy to split it out. |
I'm going to close this issue since we now track it in |
I'm interested in a very high performance representation of locales for skribo. I think what fluent-locale has is a good base, but have some ideas how to make it more performant, both in speed and in object size.
The main cost is likely the allocation of the many small
String
objects in a locale. There are existing tiny string implementations (tendril, inlinable_string, iString), but I think it's possible to do better by specializing to the needs of bcp47. Most of these strings are in the ballpark of 16 bytes each, and much of the cost is the need to spill to allocation when the strings get big. In bcp47, most of the subtags have a small, fixed maximum size.I've prototyped a "tinystr" that uses a NonZeroU32 as its backing store, and thus takes 4 bytes, even when used as an option. It also uses SIMD-like math to verify ASCII and no NUL bytes. I'm happy to PR that into this repo, or make a separate crate (there are a number of file formats that use 4 byte tags, and this would be good for those). Use of this string type would probably not be a huge code change, as it doesn't fundamentally change the architecture, just the representation. There is unsafe code, but I think it should be possible to review it to get good confidence.
A more aggressive optimization is to use an enum between a fast-path and a general-case representation. The fast path would be optional 4 byte tiny strings for language, script, and region. The general case would be a boxed struct similar to the current one, but with an 8 byte tiny string for language and variant, and 4 byte tiny strings for the other subtags. This enum is 16 bytes on both 32 and 64 bit platforms.
I'm posting an issue to get a sense of how welcome these changes are, and also whether tinystr should be its own crate or just a source file in fluent-locale.
The text was updated successfully, but these errors were encountered: