Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion about Grapheme API #11610

Open
straight-shoota opened this issue Dec 17, 2021 · 10 comments
Open

Discussion about Grapheme API #11610

straight-shoota opened this issue Dec 17, 2021 · 10 comments

Comments

@straight-shoota
Copy link
Member

The Grapheme API for strings was introduced in #11472.

However, there is still a debate about the actual data format for graphemes. They are a sequence of codepoints, which is typically represented as String. But we chose to have a dedicated type, String::Grapheme for this. It's a wrapper around String and provides an optimization for grapheme clusters that consist of only a single code point. They are stored as Char to avoid lots of tiny string allocations for the very common single-codepoints graphemes.

This was already discussed in the introducing PR (#11472) and its precursor (#10720).

To summarize, I'm copying my original comment about representation options from #11472 (comment) with some additions:


  1. The Char | String union type is a huge improvement for grapheme clusters consisting of a single character. That should be most common in typical use cases, so it saves a big deal of allocations compared to String (or a collection of Char as in the initial propsal).
    However, in some contexts, multi-character grapheme clusters are quite ubiquitous. Whether you're dealing with lots of composed emojis or scripts such as Thai, Hebrew or Arabic, or even just many decomposed diacritics, there might be a lot of string allocations for all these graphemes.
  2. We can expose the method String#each_grapheme_boundary as part of the public API. That's trivial and basically cost-free. However, I'd consider it more a low-level API. It can be useful for custom grapheme-based implementations.
  3. Employing a StringPool to avoid repeated allocations of identical grapheme clusters. There is no way to release values from a pool, so it would be a bad idea to use a global pool. But individual pools could be useful for single iterations. This could optionally be configurable and the API could allow specifying a custom pool to be used. This would be pretty easy to implement later. We just need to add StringPool parameters to Grapheme.new and the iteration methods.
    A pool means we still need to allocate strings for every grapheme, but we can safe on duplicate allocations of the same graphemes. It is likely that the same graphemes show up repeatedly in longer texts of the same context, so that should be expected to be benefitial.
  4. We could avoid allocating a string entirely by keeping a reference of the original string and making Grapheme essentially a pointer to a substring. String allocation would only happen in Grapheme#to_s. The big downside of this approach is that it keeps the original string in memory. That's a serious implication when working with large amounts of texts. If the Grapheme instaces are only used for iteration, this is not a problem. But if you start storing Graphemes somewhere, they keep pointers to the original strings alive. An example would be collecting frequencies of graphemes over a collection of texts. Assuming every evaluated texts contains a multi-character grapheme that was not encountered previously, all these strings will be kept in memory. The user would need to be aware of this behaviour and take action such as converting grapheme instances to self-owned cluster strings.
  5. (Implement Unicode grapheme clusters #11472 (comment)) Clusters consisting of multiple codepoints, but still in a small number could also be stored on the stack, with something like Char | {Char, Char} | String or {size: Int32, data: Char[4]} | String. This would cover more cases, but there is technically no limit to the length of a single cluster, so String is always needed for those that exceed the length.
  6. (Add Grapheme#each_char and #each_byte #11605 (comment)) The simplest API would just use String instead of a dedicated Grapheme type. Elixir's string implementation takes this approach. The downside is that every grapheme cluster is heap allocated. This could be combined with 1. and/or 2. for some optimization. Another improvement would be a global pool for common grapheme clusters.
@asterite
Copy link
Member

I think what's needed here is why one uses graphemes. What's the use case? Maybe it's for showing them in a text editor. I guess it must be something related to visual stuff. Maybe counting how many graphemes there are?

If those are the use cases, then adding more methods to Grapheme, like each_char and each_byte might not be needed. If one wants that, they can turn a Grapheme into a String. That allocates memory if the Grapheme stores a Char, but if these uses cases are rare then it should be fine.

So maybe keeping Grapheme like it's right now is fine.

@straight-shoota
Copy link
Member Author

straight-shoota commented Dec 17, 2021

Yes, graphemes are about presentation. But that's really in a very similar manner as characters. You only need them for visual stuff. Everything else really works well with the raw bytes (assuming an agreed encoding). The thing is just that many (most?) things we do with strings are related to visuals. Even if they don't directly deal with rendering stuff.
In fact, I'm pretty sure that when you do string operations based on characters, it very often should really be about graphemes. Using characters is a simplification when grapheme clusters are excluded in the domain or they just don't matter to the implementation.

I actually have a use case for each_char/each_byte: Fixing up #11520 (see #11478 (comment)).
They are very useful and if we don't add these methods, we might just use String instead of a dedicated Grapheme type because it defeats the purpose if I have to allocate a string again for every single-codepoint grapheme.

@asterite
Copy link
Member

In fact, I'm pretty sure that when you do string operations based on characters, it very often should really be about graphemes

Yes! I think this would be ideal.

Ideally:

  • Char should be renamed to Codepoint, or similar
  • Whatever is returning Char now, should return Grapheme, or even String if we consider these to be "the same thing"
  • Indexing a String would give you another string/grapheme
  • Same with taking a subrange: it would always break in grapheme boundaries, unless you don't want that
  • The size of a string should be the graphemes count
  • and so on...

But I think we are in a point where we can't make these changes without breaking backwards compatibility. Or maybe there's a way to do it?

@asterite
Copy link
Member

(by the way, this is how Elixir and Swift work, I think, except that Swift has a Character type that's essentially a grapheme)

@beta-ziliani
Copy link
Member

I think it's doable—and desirable—for 2.0, but not for 1.X.

@straight-shoota
Copy link
Member Author

straight-shoota commented Dec 20, 2021

I think we might be stepping away from the main topic into a different discussion. Of course, it's related and the future of grapheme vs. code point in the String API should influence the architecture of Grapheme API.

And honestly I'm not entirely sure if it's really doable - or even desirable - to essentially replace Char by Grapheme as the main constituent of strings.

Graphemes are technically more correct and probably never wrong. But there's a lot of additional complexity involved. One is memory representation - as described in this issue - and resulting performance considerations. The other is that the rules for grapheme clusters are not even static. They depend on the Unicode version. Programs using different Unicode versions may have different opinions whether a sequence of code points describes a single grapheme cluster or multiple graphemes. That's different from encoding codepoints in UTF-8, for example. The format is independent of Unicode versions and even implementations based on older versions can work with that. They may not understand the meaning of the code points, but there's no ambiguity about the byte representation of code points.

@hutou
Copy link
Contributor

hutou commented Oct 27, 2022

Some Unicode characters displayed in a terminal take up more than one column, and therefore can break the alignment of tables. So, it would be useful to have a method giving the "visual" length of a Unicode string.
For example, ruby has the following gem : ruby-unicode-display-width
Would it be appropriate to include this use case in this discussion ?

@Blacksmoke16
Copy link
Member

@hutou #11038

@hutou
Copy link
Contributor

hutou commented Oct 27, 2022

@Blacksmoke16, Thanks

@straight-shoota
Copy link
Member Author

With #13335 it's easy to add additional fields to String (ref. #12020 (comment)). This could open an opportunity to cache grapheme_size in the string header for repeated use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants