-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion about Grapheme API #11610
Comments
I think what's needed here is why one uses graphemes. What's the use case? Maybe it's for showing them in a text editor. I guess it must be something related to visual stuff. Maybe counting how many graphemes there are? If those are the use cases, then adding more methods to Grapheme, like So maybe keeping |
Yes, graphemes are about presentation. But that's really in a very similar manner as characters. You only need them for visual stuff. Everything else really works well with the raw bytes (assuming an agreed encoding). The thing is just that many (most?) things we do with strings are related to visuals. Even if they don't directly deal with rendering stuff. I actually have a use case for |
Yes! I think this would be ideal. Ideally:
But I think we are in a point where we can't make these changes without breaking backwards compatibility. Or maybe there's a way to do it? |
(by the way, this is how Elixir and Swift work, I think, except that Swift has a |
I think it's doable—and desirable—for 2.0, but not for 1.X. |
I think we might be stepping away from the main topic into a different discussion. Of course, it's related and the future of grapheme vs. code point in the String API should influence the architecture of Grapheme API. And honestly I'm not entirely sure if it's really doable - or even desirable - to essentially replace Graphemes are technically more correct and probably never wrong. But there's a lot of additional complexity involved. One is memory representation - as described in this issue - and resulting performance considerations. The other is that the rules for grapheme clusters are not even static. They depend on the Unicode version. Programs using different Unicode versions may have different opinions whether a sequence of code points describes a single grapheme cluster or multiple graphemes. That's different from encoding codepoints in UTF-8, for example. The format is independent of Unicode versions and even implementations based on older versions can work with that. They may not understand the meaning of the code points, but there's no ambiguity about the byte representation of code points. |
Some Unicode characters displayed in a terminal take up more than one column, and therefore can break the alignment of tables. So, it would be useful to have a method giving the "visual" length of a Unicode string. |
@Blacksmoke16, Thanks |
With #13335 it's easy to add additional fields to |
The Grapheme API for strings was introduced in #11472.
However, there is still a debate about the actual data format for graphemes. They are a sequence of codepoints, which is typically represented as
String
. But we chose to have a dedicated type,String::Grapheme
for this. It's a wrapper aroundString
and provides an optimization for grapheme clusters that consist of only a single code point. They are stored asChar
to avoid lots of tiny string allocations for the very common single-codepoints graphemes.This was already discussed in the introducing PR (#11472) and its precursor (#10720).
To summarize, I'm copying my original comment about representation options from #11472 (comment) with some additions:
Char | String
union type is a huge improvement for grapheme clusters consisting of a single character. That should be most common in typical use cases, so it saves a big deal of allocations compared toString
(or a collection ofChar
as in the initial propsal).However, in some contexts, multi-character grapheme clusters are quite ubiquitous. Whether you're dealing with lots of composed emojis or scripts such as Thai, Hebrew or Arabic, or even just many decomposed diacritics, there might be a lot of string allocations for all these graphemes.
String#each_grapheme_boundary
as part of the public API. That's trivial and basically cost-free. However, I'd consider it more a low-level API. It can be useful for custom grapheme-based implementations.StringPool
to avoid repeated allocations of identical grapheme clusters. There is no way to release values from a pool, so it would be a bad idea to use a global pool. But individual pools could be useful for single iterations. This could optionally be configurable and the API could allow specifying a custom pool to be used. This would be pretty easy to implement later. We just need to addStringPool
parameters toGrapheme.new
and the iteration methods.A pool means we still need to allocate strings for every grapheme, but we can safe on duplicate allocations of the same graphemes. It is likely that the same graphemes show up repeatedly in longer texts of the same context, so that should be expected to be benefitial.
Grapheme
essentially a pointer to a substring. String allocation would only happen inGrapheme#to_s
. The big downside of this approach is that it keeps the original string in memory. That's a serious implication when working with large amounts of texts. If theGrapheme
instaces are only used for iteration, this is not a problem. But if you start storingGraphemes
somewhere, they keep pointers to the original strings alive. An example would be collecting frequencies of graphemes over a collection of texts. Assuming every evaluated texts contains a multi-character grapheme that was not encountered previously, all these strings will be kept in memory. The user would need to be aware of this behaviour and take action such as converting grapheme instances to self-owned cluster strings.Char | {Char, Char} | String
or{size: Int32, data: Char[4]} | String
. This would cover more cases, but there is technically no limit to the length of a single cluster, soString
is always needed for those that exceed the length.Grapheme#each_char
and#each_byte
#11605 (comment)) The simplest API would just useString
instead of a dedicatedGrapheme
type. Elixir's string implementation takes this approach. The downside is that every grapheme cluster is heap allocated. This could be combined with 1. and/or 2. for some optimization. Another improvement would be a global pool for common grapheme clusters.The text was updated successfully, but these errors were encountered: