Display Width of Multi-Codepoint Graphemes #66

brodieG · 2020-06-01T14:21:56Z

Related to #62.

This should primarily affect emoji as most other multi-chararcter graphemes tend to involve zero width trailing characters, so the overall grapheme width is given by the base code point.

Ideally this would be resolved by support directly in R.

brodieG · 2021-05-15T17:15:24Z

We'll rely on R implementing this.

brodieG · 2021-06-11T11:49:13Z

It seems unlikely this is going to get resolved in R in the near term, so we'll contemplate doing it in fansi.

brodieG · 2021-06-19T15:12:54Z

For now we're going to implement a hack version that hopefully works well enough in most cases, and revert to hoping R does this properly for a full fix.

@krlmlr

# fansi Release Notes ## v1.0.4 CRAN compiled code warning suppression release. * Fix void function declarations and definitions. * Change `sprintf` to `snprintf`. ## v1.0.3 * Address problem uncovered by gcc-12 linters, although the issue itself could not manifest due to redundancy of checks in the code. ## v1.0.0-2 This is a major release and includes some behavior changes. ### Features * New functions: * [#26](brodieG/fansi#26) Replacement forms of `substr_cl` (i.e `substr_ctl<-`). * `state_at_end` to compute active state at end of a string. * `close_state` to generate a closing sequence given an active state. * [#31](brodieG/fansi#31) `trimws_ctl` as an equivalent to `trimws`. * [#64](brodieG/fansi#64) `normalize_sgr` converts compound _Control Sequences_ into normalized form (e.g. "ESC[44;31m" becomes "ESC[31mESC[44m") for better compatibility with [`crayon`](https://github.com/r-lib/crayon). Additionally, most functions gain a `normalize` parameter so that they may return their output in normalized form (h/t @krlmlr). * [#74](https://github.com/brodieG/fansi/issues/74)`substr_ctl` and related functions are now all-C instead of a combination of C offset computations and R level `substr` operations. This greatly improves performance, particularly for vectors with many distinct strings. Despite documentation claiming otherwise, `substr_ctl` was quite slow in that case. * [#66](brodieG/fansi#66) Improved grapheme support, including accounting for them in `type="width"` mode, as well as a `type="graphemes"` mode to measure in graphemes instead of characters. Implementation is based on heuristics designed to work in most common use cases. * `html_esc` gains a `what` parameter to indicate which HTML special characters should be escaped. * Many functions gain `carry` and `terminate` parameters to control how `fansi` generated substrings interact with surrounding formats. * [#71](brodieG/fansi#71) Functions that write SGR and OSC are now more parsimonious (see "Behavior Changes" below). * [#73](brodieG/fansi#73) Default parameter values retrieved with `getOption` now always have explicit fallback values defined (h/t @gadenbui). * Better warnings and error messages, including more granular messages for `unhandled_ctl` for adjacent _Control Sequences_. * `term.cap` parameter now accepts "all" as value, like the `ctl` parameter. ### Deprecated Functions * All the "sgr" functions (e.g., `substr_sgr`, `strwrap_sgr`) are deprecated. They will likely live on indefinitely, but they are of limited usefulness and with the added support for OSC hyperlinks their name is misleading. * `sgr_to_html` is now `to_html` with slight modifications to semantics; the old function remains and does not warn about unescaped "<" or ">" in the input string. ### Behavior Changes The major intentional behavior change is to default `fansi` to always recognize true color CSI SGR sequences (e.g. `"ESC[38;2;128;50;245m"`). The prior default was to match the active terminal capabilities, but it is unlikely that the intent of a user manipulating a string with truecolor sequences is to interpret them incorrectly, even if their terminal does. `fansi` will continue to warn in this case. To keep the pre-1.0 behavior add `"old"` to the `term.cap` parameter. Additionally, `to_html` will now warn if it encounters unescaped HTML special character "<" or ">" in the input string. Finally, the 1.0 release is an extensive refactoring of many parts of the SGR and OSC hyperlink controls (_Special Sequences_) intake and output algorithms. In some cases this means that some `fansi` functions will output _Special Sequences_ slightly differently than they did before. In almost all cases the rendering of the output should remain unchanged, although there are some corner cases with changes (e.g. in `strwrap_ctl` SGRs embedded in whitespace sequences don't break the sequence). The changes are a side effect of applying more consistent treatment of corner cases around leading and trailing control sequences and (partially) invalid control sequences. Trailing _Special Sequences_ in the output is now omitted as it would be immediately closed (assuming `terminate=TRUE`, the default). Leading SGR is interpreted and re-output. Normally output consistency alone would not be a reason to change behavior, but in this case the changes should be almost always undetectable in the **rendered** output, and maintaining old inconsistent behavior in the midst of a complete refactoring of the internals was beyond my patience. I apologize if these behavior changes adversely affect your programs. > WARNING: we will strive to keep rendered appearance of `fansi` outputs > consistent across releases, but the exact bytes used in the output of _Special > Sequences_ may change. Other changes: * Tests may no longer pass with R < 4.0 although the package should still function correctly. This is primarily because of changes to the character width Unicode Database that ships with R, and many of the newly added grapheme tests touch parts of that database that changed (emoji). * CSI sequences with more than one "intermediate" byte are now considered valid, even though they are likely to be very rare, and CSI sequences consume all subsequent bytes until a valid closing byte or end of string is encountered. * `strip_ctl` only warns with malformed CSI and OSC if they are reported as supported via the `ctl` parameter. If CSI and OSC are indicated as not supported, but two byte escapes are, the two initial bytes of CSI and OSCs will be stripped. * "unknown" encoded strings are no longer translated to UTF-8 in UTF-8 locales (they are instead assumed to be UTF-8). * `nchar_ctl` preserves `dim`, `dimnames`, and `names` as the base functions do. * UTF-8 known to be invalid should not be output, even if present in input (UTF-8 validation is not complete, only sequences that are obviously wrong are detected). ### Bug Fixes * Fix `tabs_as_spaces` to handle sequential tabs, and to perform better on very wide strings. * Strings with invalid UTF-8 sequences with "unknown" declared encoding in UTF-8 locales now cause errors instead of being silently translated into byte escaped versions (e.g. "\xf0\xc2" (2 bytes), used to be interpreted as "<f0><c2>" (four characters). These now cause errors as they would have if they had had "UTF-8" declared encoding. * In some cases true colors of form "38;2;x;x;x" and "48;2;x;x;x" would only be partially transcribed. ### Internal Changes * More aggressive UTF-8 validation, also, invalid UTF-8 code points now advance only one byte instead of their putative width based on the initial byte. * Reduce peak memory usage by making some intermediate buffers eligible for garbage collection prior to native code returning to R. * Reworked internals to simplify buffer size computation and synchronization, in some cases this might cause slightly reduced performance. Please report any significant performance regressions. * `nchar_ctl(...)` is no longer a wrapper for `nchar(strip_ctl(...))` so that it may correctly support grapheme width calculations.

brodieG added the enhancement label Jun 1, 2020

brodieG added this to the 0.5.0 milestone Jun 1, 2020

brodieG mentioned this issue Jun 1, 2020

Display width of some (?) emojis #62

Closed

brodieG added the wontfix label May 15, 2021

brodieG closed this as completed May 15, 2021

brodieG reopened this Jun 11, 2021

brodieG added fixed in dev and removed wontfix labels Jun 28, 2021

brodieG closed this as completed in ea44713 Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Display Width of Multi-Codepoint Graphemes #66

Display Width of Multi-Codepoint Graphemes #66

brodieG commented Jun 1, 2020

brodieG commented May 15, 2021

brodieG commented Jun 11, 2021

brodieG commented Jun 19, 2021

Display Width of Multi-Codepoint Graphemes #66

Display Width of Multi-Codepoint Graphemes #66

Comments

brodieG commented Jun 1, 2020

brodieG commented May 15, 2021

brodieG commented Jun 11, 2021

brodieG commented Jun 19, 2021