Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue: refactor string packages handling grapheme clusters in terms of "base" packages #1062

Open
5 of 11 tasks
kgryte opened this issue Jul 20, 2023 · 1 comment
Open
5 of 11 tasks
Assignees

Comments

@kgryte
Copy link
Member

kgryte commented Jul 20, 2023

The purpose of this issue is to track tasks related to the effort to refactor string packages handling grapheme clusters to use "base" packages which handle more specialized use cases.

Overview

String packages, such as @stdlib/string/first, have several possible "modes" of operation. When getting the first character, a straightforward approach would use indexing. E.g.,

var str = 'Hello, World!';

var ch = str[ 0 ];
// returns 'H'

This works according to user expectation so long as a character is a relatively common character which can be stored in a single UTF-16 code unit. However, this inevitably does not live up to user intuition when the first visual character is comprised of multiple code units.

As such, one has three options for resolving the first character:

  • code units
  • code points (one or more code units)
  • grapheme clusters (one or more code points)

The most robust approach for matching user intuition is to resolve grapheme clusters (i.e., user-perceived visual characters), especially for text which may include emojis with skin tones and modified characteristics. However, resolving grapheme clusters is comparatively slow and may lead to unacceptable performance issues, especially when working with simple text.

Solution

Rather than provide a single API which only processes text as a sequence of grapheme clusters, the proposed solution is to refactor top-level @stdlib/string/* packages which handle grapheme clusters to support different "modes" of operation, whereby a user can choose which type of processing is most appropriate for given input strings.

Internally, packages supporting different modes should rely on separate, specialized "base" packages (@stdlib/string/base/*) which implement appropriate algorithms for resolving code units, code points, and grapheme clusters, respectively.

Prior Art

For examples of refactorings, see

  • @stdlib/string/first
    • @stdlib/string/base/first
    • @stdlib/string/base/first-code-point
    • @stdlib/string/base/first-grapheme-cluster
  • @stdlib/string/for-each
    • @stdlib/string/base/for-each
    • @stdlib/string/base/for-each-code-point
    • @stdlib/string/base/for-each-grapheme-cluster

Tasks

The following packages should be refactored to use the proposed solution:

The following package implementation needs to be rewritten:

  • @stdlib/string/base/prev-grapheme-cluster

Notes

In general, refactoring should happen in the following order:

  1. Create the base package processing grapheme clusters (package name should have a -grapheme-cluster or -grapheme-clusters suffix). This is often similar to the top-level package, but stripped of input argument validation and optional arguments.
  2. Create the base package for processing Unicode code units (package name should have a -code-point or -code-points suffix).
  3. Create the base package for processing UTF-16 code units (if necessary, package name should have a -code-unit or -code-units suffix).
  4. Refactor the top-level package to depend on the base packages and add support for specifying a mode option.
@steff456 steff456 self-assigned this Aug 1, 2023
kgryte added a commit that referenced this issue Aug 4, 2023
This commit adds the following new packages:

- `@stdlib/string/base/remove-first`
- `@stdlib/string/base/remove-first-code-point`
- `@stdlib/string/base/remove-first-grapheme-cluster`

The top-level `@stdlib/string/remove-first` is refactored to depend on those "base" packages, with the default behavior remaining the same (i.e., removing the first grapheme cluster) and a new "mode" option for specifying what type of character to remove.

PR-URL: 	#1073
Co-authored-by: Athan Reines <kgryte@gmail.com>
Reviewed-by: Athan Reines <kgryte@gmail.com> 
Ref: #1062
@steff456
Copy link
Contributor

steff456 commented Sep 4, 2023

We also need to fix the implementation of prev-grapheme-cluster because it is not having the same results as next-grapheme-cluster. A fix for this bug is needed and I believe that we can create a more performant implementation for this package as well by using the package that returns the complete number of grapheme clusters and then traverse the string from right to left by aggregating from the given index and checking that the result is still a grapheme cluster as a whole.

kgryte added a commit that referenced this issue Sep 21, 2023
This commit adds 3 new base string packages for removing code units, code points, and grapheme clusters, respectively. This commit subsequently refactors `string/remove-last` to depend on those base packages. As a consequence, a new option has been added to `string/remove-last` to select which processing "mode" is desired in order to balance performance considerations.

Additionally, this commit fixes a bug in `string/remove-first` due to an off-by-one indexing error.

Lastly, this commit fixes the `name` field in `string/base/remove-first*` `package.json` files.

PR-URL: 	#1079
Co-authored-by: Athan Reines <kgryte@gmail.com>
Reviewed-by: Athan Reines <kgryte@gmail.com> 
Ref: #1062
kgryte added a commit that referenced this issue Oct 31, 2023
This commit also adds the following "base" packages:

- `string/base/reverse`
- `string/base/reverse-code-points`
- `string/base/reverse-grapheme-clusters`

PR-URL: 	#1082
Co-authored-by: Athan Reines <kgryte@gmail.com>
Reviewed-by: Athan Reines <kgryte@gmail.com> 
Ref: #1062
kgryte added a commit that referenced this issue Nov 10, 2023
PR-URL: 	#1117
Co-authored-by: Athan Reines <kgryte@gmail.com>
Reviewed-by: Athan Reines <kgryte@gmail.com>
Ref: #1062
kgryte added a commit that referenced this issue Nov 10, 2023
PR-URL: 	#1118
Co-authored-by: Athan Reines <kgryte@gmail.com>
Reviewed-by: Athan Reines <kgryte@gmail.com>
Ref: #1062
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants