Discussion about Grapheme API #11610

straight-shoota · 2021-12-17T12:34:34Z

The Grapheme API for strings was introduced in #11472.

However, there is still a debate about the actual data format for graphemes. They are a sequence of codepoints, which is typically represented as String. But we chose to have a dedicated type, String::Grapheme for this. It's a wrapper around String and provides an optimization for grapheme clusters that consist of only a single code point. They are stored as Char to avoid lots of tiny string allocations for the very common single-codepoints graphemes.

This was already discussed in the introducing PR (#11472) and its precursor (#10720).

To summarize, I'm copying my original comment about representation options from #11472 (comment) with some additions:

The Char | String union type is a huge improvement for grapheme clusters consisting of a single character. That should be most common in typical use cases, so it saves a big deal of allocations compared to String (or a collection of Char as in the initial propsal).
However, in some contexts, multi-character grapheme clusters are quite ubiquitous. Whether you're dealing with lots of composed emojis or scripts such as Thai, Hebrew or Arabic, or even just many decomposed diacritics, there might be a lot of string allocations for all these graphemes.
We can expose the method String#each_grapheme_boundary as part of the public API. That's trivial and basically cost-free. However, I'd consider it more a low-level API. It can be useful for custom grapheme-based implementations.
Employing a StringPool to avoid repeated allocations of identical grapheme clusters. There is no way to release values from a pool, so it would be a bad idea to use a global pool. But individual pools could be useful for single iterations. This could optionally be configurable and the API could allow specifying a custom pool to be used. This would be pretty easy to implement later. We just need to add StringPool parameters to Grapheme.new and the iteration methods.
A pool means we still need to allocate strings for every grapheme, but we can safe on duplicate allocations of the same graphemes. It is likely that the same graphemes show up repeatedly in longer texts of the same context, so that should be expected to be benefitial.
We could avoid allocating a string entirely by keeping a reference of the original string and making Grapheme essentially a pointer to a substring. String allocation would only happen in Grapheme#to_s. The big downside of this approach is that it keeps the original string in memory. That's a serious implication when working with large amounts of texts. If the Grapheme instaces are only used for iteration, this is not a problem. But if you start storing Graphemes somewhere, they keep pointers to the original strings alive. An example would be collecting frequencies of graphemes over a collection of texts. Assuming every evaluated texts contains a multi-character grapheme that was not encountered previously, all these strings will be kept in memory. The user would need to be aware of this behaviour and take action such as converting grapheme instances to self-owned cluster strings.
(Implement Unicode grapheme clusters #11472 (comment)) Clusters consisting of multiple codepoints, but still in a small number could also be stored on the stack, with something like Char | {Char, Char} | String or {size: Int32, data: Char[4]} | String. This would cover more cases, but there is technically no limit to the length of a single cluster, so String is always needed for those that exceed the length.
(Add Grapheme#each_char and #each_byte #11605 (comment)) The simplest API would just use String instead of a dedicated Grapheme type. Elixir's string implementation takes this approach. The downside is that every grapheme cluster is heap allocated. This could be combined with 1. and/or 2. for some optimization. Another improvement would be a global pool for common grapheme clusters.

The text was updated successfully, but these errors were encountered:

asterite · 2021-12-17T14:47:53Z

I think what's needed here is why one uses graphemes. What's the use case? Maybe it's for showing them in a text editor. I guess it must be something related to visual stuff. Maybe counting how many graphemes there are?

If those are the use cases, then adding more methods to Grapheme, like each_char and each_byte might not be needed. If one wants that, they can turn a Grapheme into a String. That allocates memory if the Grapheme stores a Char, but if these uses cases are rare then it should be fine.

So maybe keeping Grapheme like it's right now is fine.

straight-shoota · 2021-12-17T15:04:22Z

Yes, graphemes are about presentation. But that's really in a very similar manner as characters. You only need them for visual stuff. Everything else really works well with the raw bytes (assuming an agreed encoding). The thing is just that many (most?) things we do with strings are related to visuals. Even if they don't directly deal with rendering stuff.
In fact, I'm pretty sure that when you do string operations based on characters, it very often should really be about graphemes. Using characters is a simplification when grapheme clusters are excluded in the domain or they just don't matter to the implementation.

I actually have a use case for each_char/each_byte: Fixing up #11520 (see #11478 (comment)).
They are very useful and if we don't add these methods, we might just use String instead of a dedicated Grapheme type because it defeats the purpose if I have to allocate a string again for every single-codepoint grapheme.

asterite · 2021-12-17T15:08:45Z

In fact, I'm pretty sure that when you do string operations based on characters, it very often should really be about graphemes

Yes! I think this would be ideal.

Ideally:

Char should be renamed to Codepoint, or similar
Whatever is returning Char now, should return Grapheme, or even String if we consider these to be "the same thing"
Indexing a String would give you another string/grapheme
Same with taking a subrange: it would always break in grapheme boundaries, unless you don't want that
The size of a string should be the graphemes count
and so on...

But I think we are in a point where we can't make these changes without breaking backwards compatibility. Or maybe there's a way to do it?

asterite · 2021-12-17T15:09:17Z

(by the way, this is how Elixir and Swift work, I think, except that Swift has a Character type that's essentially a grapheme)

beta-ziliani · 2021-12-17T18:08:49Z

I think it's doable—and desirable—for 2.0, but not for 1.X.

straight-shoota · 2021-12-20T11:21:01Z

I think we might be stepping away from the main topic into a different discussion. Of course, it's related and the future of grapheme vs. code point in the String API should influence the architecture of Grapheme API.

And honestly I'm not entirely sure if it's really doable - or even desirable - to essentially replace Char by Grapheme as the main constituent of strings.

Graphemes are technically more correct and probably never wrong. But there's a lot of additional complexity involved. One is memory representation - as described in this issue - and resulting performance considerations. The other is that the rules for grapheme clusters are not even static. They depend on the Unicode version. Programs using different Unicode versions may have different opinions whether a sequence of code points describes a single grapheme cluster or multiple graphemes. That's different from encoding codepoints in UTF-8, for example. The format is independent of Unicode versions and even implementations based on older versions can work with that. They may not understand the meaning of the code points, but there's no ambiguity about the byte representation of code points.

hutou · 2022-10-27T20:49:46Z

Some Unicode characters displayed in a terminal take up more than one column, and therefore can break the alignment of tables. So, it would be useful to have a method giving the "visual" length of a Unicode string.
For example, ruby has the following gem : ruby-unicode-display-width
Would it be appropriate to include this use case in this discussion ?

Blacksmoke16 · 2022-10-27T20:51:32Z

@hutou #11038

hutou · 2022-10-27T20:58:30Z

@Blacksmoke16, Thanks

straight-shoota · 2023-04-20T10:41:00Z

With #13335 it's easy to add additional fields to String (ref. #12020 (comment)). This could open an opportunity to cache grapheme_size in the string header for repeated use.

straight-shoota added status:discussion topic:stdlib:text labels Dec 17, 2021

straight-shoota mentioned this issue Dec 17, 2021

Add Experimental annotation to Grapheme API #11611

Merged

straight-shoota mentioned this issue Apr 20, 2023

Cache code range in String #12020

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion about Grapheme API #11610

Discussion about Grapheme API #11610

straight-shoota commented Dec 17, 2021

asterite commented Dec 17, 2021

straight-shoota commented Dec 17, 2021 •

edited

Loading

asterite commented Dec 17, 2021

asterite commented Dec 17, 2021

beta-ziliani commented Dec 17, 2021

straight-shoota commented Dec 20, 2021 •

edited

Loading

hutou commented Oct 27, 2022

Blacksmoke16 commented Oct 27, 2022

hutou commented Oct 27, 2022

straight-shoota commented Apr 20, 2023

Discussion about Grapheme API #11610

Discussion about Grapheme API #11610

Comments

straight-shoota commented Dec 17, 2021

asterite commented Dec 17, 2021

straight-shoota commented Dec 17, 2021 • edited Loading

asterite commented Dec 17, 2021

asterite commented Dec 17, 2021

beta-ziliani commented Dec 17, 2021

straight-shoota commented Dec 20, 2021 • edited Loading

hutou commented Oct 27, 2022

Blacksmoke16 commented Oct 27, 2022

hutou commented Oct 27, 2022

straight-shoota commented Apr 20, 2023

straight-shoota commented Dec 17, 2021 •

edited

Loading

straight-shoota commented Dec 20, 2021 •

edited

Loading