-
-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue: refactor string packages handling grapheme clusters in terms of "base" packages #1062
Comments
This commit adds the following new packages: - `@stdlib/string/base/remove-first` - `@stdlib/string/base/remove-first-code-point` - `@stdlib/string/base/remove-first-grapheme-cluster` The top-level `@stdlib/string/remove-first` is refactored to depend on those "base" packages, with the default behavior remaining the same (i.e., removing the first grapheme cluster) and a new "mode" option for specifying what type of character to remove. PR-URL: #1073 Co-authored-by: Athan Reines <kgryte@gmail.com> Reviewed-by: Athan Reines <kgryte@gmail.com> Ref: #1062
We also need to fix the implementation of |
This commit adds 3 new base string packages for removing code units, code points, and grapheme clusters, respectively. This commit subsequently refactors `string/remove-last` to depend on those base packages. As a consequence, a new option has been added to `string/remove-last` to select which processing "mode" is desired in order to balance performance considerations. Additionally, this commit fixes a bug in `string/remove-first` due to an off-by-one indexing error. Lastly, this commit fixes the `name` field in `string/base/remove-first*` `package.json` files. PR-URL: #1079 Co-authored-by: Athan Reines <kgryte@gmail.com> Reviewed-by: Athan Reines <kgryte@gmail.com> Ref: #1062
The purpose of this issue is to track tasks related to the effort to refactor string packages handling grapheme clusters to use "base" packages which handle more specialized use cases.
Overview
String packages, such as
@stdlib/string/first
, have several possible "modes" of operation. When getting the first character, a straightforward approach would use indexing. E.g.,This works according to user expectation so long as a character is a relatively common character which can be stored in a single UTF-16 code unit. However, this inevitably does not live up to user intuition when the first visual character is comprised of multiple code units.
As such, one has three options for resolving the first character:
The most robust approach for matching user intuition is to resolve grapheme clusters (i.e., user-perceived visual characters), especially for text which may include emojis with skin tones and modified characteristics. However, resolving grapheme clusters is comparatively slow and may lead to unacceptable performance issues, especially when working with simple text.
Solution
Rather than provide a single API which only processes text as a sequence of grapheme clusters, the proposed solution is to refactor top-level
@stdlib/string/*
packages which handle grapheme clusters to support different "modes" of operation, whereby a user can choose which type of processing is most appropriate for given input strings.Internally, packages supporting different modes should rely on separate, specialized "base" packages (
@stdlib/string/base/*
) which implement appropriate algorithms for resolving code units, code points, and grapheme clusters, respectively.Prior Art
For examples of refactorings, see
@stdlib/string/first
@stdlib/string/base/first
@stdlib/string/base/first-code-point
@stdlib/string/base/first-grapheme-cluster
@stdlib/string/for-each
@stdlib/string/base/for-each
@stdlib/string/base/for-each-code-point
@stdlib/string/base/for-each-grapheme-cluster
Tasks
The following packages should be refactored to use the proposed solution:
@stdlib/string/first
@stdlib/string/for-each
@stdlib/string/left-trim-n
@stdlib/string/remove-first
- feat: refactor string packageremove-first
#1073@stdlib/string/remove-last
- feat: refactor string package remove-last #1079@stdlib/string/reverse
- feat: refactor string package reverse #1082@stdlib/string/right-trim-n
@stdlib/string/truncate
- feat: refactor truncate string package #1097@stdlib/string/truncate-middle
@stdlib/string/base/distances/levenshtein
The following package implementation needs to be rewritten:
@stdlib/string/base/prev-grapheme-cluster
Notes
In general, refactoring should happen in the following order:
-grapheme-cluster
or-grapheme-clusters
suffix). This is often similar to the top-level package, but stripped of input argument validation and optional arguments.-code-point
or-code-points
suffix).-code-unit
or-code-units
suffix).mode
option.The text was updated successfully, but these errors were encountered: