Fix #66: graphemes, ...

Also more consistent treatment of leading and trailing spaces and SGR, and many other fixes/cleanup
brodieG · Jun 27, 2021 · ea44713 · ea44713
2 parents e9a7e55 + b4a2cfc
commit ea44713
Show file tree

Hide file tree

Showing 51 changed files with 1,141 additions and 644 deletions.
diff --git a/DEVNOTES.md → DEVNOTES.Rmd b/DEVNOTES.md → DEVNOTES.Rmd
diff --git a/NEWS.md b/NEWS.md
@@ -4,16 +4,43 @@
 
 ### Features
 
+* [#66](https://github.com/brodieG/fansi/issues/66) Improved handling of
+  graphemes in `type="width"` mode.  Flags and well formed emoji sequences
+  should have widths computed correctly in most common use cases.
 * [#64](https://github.com/brodieG/fansi/issues/64) New function `normalize_sgr`
   converts compound SGR sequences into normalized form (e.g. "ESC[44;31m"
   becomes "ESC[31mESC[44m") for better compatibility with
   [`crayon`](https://github.com/r-lib/crayon).  Additionally, most functions
   gain a `normalize` parameter so that they may return their output in
   normalized form.
-* [#71](https://github.com/brodieG/fansi/issues/71) Functions that write SGR are
-  now more parsimonious.
 * `html_esc` gains a `what` parameter to indicate which HTML special characters
   should be escaped.
+* Many functions gain `carry` and `terminate` parameters to control how `fansi`
+  generated substrings interact with surrounding formats.
+* New function `state_at_end` to compute active SGR state at end of a string.
+* New function `close_sgr` to generate a closing SGR sequence given an active
+  SGR state.
+* [#71](https://github.com/brodieG/fansi/issues/71) Functions that write SGR are
+  now more parsimonious (see "Behavior Changes" below).
+
+### Behavior Changes
+
+A big part of the 1.0 release is an extensive refactoring of many parts of the
+ANSI CSI SGR intake and output algorithms.  In some cases this means that some
+`fansi` functions will output SGR slightly differently than they did before.  In
+almost all cases the rendering of the SGR should remain unchanged, although
+there are some corner cases with changes (e.g. in `strwrap_ctl` SGRs embedded in
+whitespace sequences don't break the sequence).
+
+The changes are a side effect of applying more consistent treatment of corner
+cases around leading and trailing SGR in substrings.  Trailing SGR in the output
+is now omitted as it would be immediately closed (assuming `terminate=TRUE`, the
+default).  Leading SGR is interpreted and re-output.
+
+Normally output consistency alone would not be a reason to change behavior, but
+in this case the changes should be almost always undetectable in the
+**rendered** output, and maintaining old behavior would further complicate
+finicky C string manipulation code.
 
 ### Bug Fixes
 
@@ -22,6 +49,8 @@
 
 ### Internal Changes
 
+* More aggressive UTF-8 validation, also, invalid UTF-8 now advance only one
+  byte instead of their putative width based on a valid initial byte.
 * Reduce peak memory usage by making some intermediate buffers eligible for
   garbage collection prior to native code returning to R.
 * Reworked internals to simplify buffer computation and synchronization.

diff --git a/R/fansi-package.R b/R/fansi-package.R
@@ -97,6 +97,10 @@
 #' `fansi` will will warn if it encounters _Control Sequences_ that it cannot
 #' interpret or that might conflict with terminal capabilities.  You can turn
 #' off warnings via the `warn` parameter or via the "fansi.warn" global option.
+#' Any SGR codes that it interprets and re-outputs in substrings will be
+#' compatible with the specified terminal capabilities; however, some parts of
+#' substrings are copied as-is and those will retain the original unsupported
+#' SGR codes.
 #'
 #' `fansi` can work around "C0" tab control characters by turning them into
 #' spaces first with [`tabs_as_spaces`] or with the `tabs.as.spaces` parameter
@@ -116,38 +120,48 @@
 #' the effect is the same as replacement (e.g. if you have a color active and
 #' pick another one).
 #'
+#' While we try to minimize changes across `fansi` versions in how SGR sequences
+#' are output, we focus on minimizing the changes to rendered output, not
+#' necessarily the specific SGR sequences used to produce it.  To maximize the
+#' odds of getting stable SGR output use [`normalize_sgr`] and set `term.cap` to
+#' a specific set of capabilities.  In general it is likely best not to rely on
+#' the exact SGR encoding of `fansi` output.
+#'
+#' Note that `width` calculations may also change across R versions, locales,
+#' etc. (see "Encodings / UTF-8" below).
+#'
 #' @section SGR Interactions:
 #'
 #' The cumulative nature of SGR means that SGR in strings that are spliced will
 #' interact with each other.  Additionally, a substring does not inherently
 #' contain all the information required to recreate its formatting as it
 #' appeared in its source string.
 #'
-#' One form of possible interaction to consider is how a character vector
-#' provided to `fansi` functions interacts with itself.  By default, `fansi`
-#' assumes that each element in an input character vector is independent, but
-#' this is incorrect if the input is a single document with each element a line
-#' in it.  In that situation unterminated SGR from each line should bleed into
-#' subsequent ones.  Setting `carry = TRUE` enables the "single document"
-#' interpretation. [`sgr_to_html`] is the exception as for legacy reasons it
-#' defaults to `carry = TRUE`.
+#' One form of interaction to consider is how a character vector provided to
+#' `fansi` functions affect itself.  By default, `fansi` assumes that each
+#' element in an input character vector is independent, but this is incorrect if
+#' the input is a single document with each element a line in it.  In that
+#' situation unterminated SGR codes from each line should bleed into subsequent
+#' ones.  Setting `carry = TRUE` enables the "single document" interpretation.
+#' [`sgr_to_html`] is the exception as for legacy reasons it defaults to `carry
+#' = TRUE`.
 #'
 #' Another form of interaction is when substrings produced by `fansi` are
 #' spliced with or into other substrings.  By default `fansi` automatically
-#' terminates substrings it produces if they contain active SGR.  This prevents
-#' the SGR therein from affecting display of external strings, which is useful
-#' e.g. when arranging text in columns.  We can allow the SGR to bleed into
-#' appended strings by setting `terminate = FALSE`.  `carry` is unaffected by
-#' `terminate` as `fansi` records the ending SGR state prior to termination
-#' internally.
-#'
-#' Finally, `fansi` strings will be affected by any active SGR in strings they
-#' are appended to.  There are no parameters to control what happens
-#' automatically in this case, but `fansi` provides several functions that can
-#' help the user get their desired outcome.  `sgr_at_end` computes the active
-#' SGR at the end of a string, this can then be prepended onto the _input_ of
-#' `fansi` functions so that they are aware of what the active style at the
-#' beginning of the string.  Alternatively, one could use
+#' terminates substrings it produces if they contain active SGR formats.  This
+#' prevents the SGR formats therein from affecting display of external strings,
+#' which is useful e.g. when arranging text in columns.  We can allow the SGR
+#' formats to bleed into appended strings by setting `terminate = FALSE`.
+#' `carry` is unaffected by `terminate` as `fansi` records the ending SGR state
+#' prior to termination internally.
+#'
+#' Finally, `fansi` strings will be affected by any active SGR formats in
+#' strings they are appended to.  There are no parameters to control what
+#' happens automatically in this case, but `fansi` provides several functions
+#' that can help the user get their desired outcome.  `sgr_at_end` computes the
+#' active SGR at the end of a string, this can then be prepended onto the
+#' _input_ of `fansi` functions so that they are aware of the active style
+#' at the beginning of the string.  Alternatively, one could use
 #' `close_sgr(sgr_at_end(...))` and pre-pend that to the _output_ of `fansi`
 #' functions so they are unaffected by preceding SGR.  One could also just
 #' prepend "ESC[0m", but in some cases as described in
@@ -171,40 +185,27 @@
 #' These issues are most likely to occur with invalid UTF-8 sequences,
 #' combining character sequences, and emoji.  For example, whether special
 #' characters such as emoji are considered one or two wide evolves as software
-#' adopts newer versions of Unicode.  Do not expect the `fansi` width
-#' calculations to always work correctly with strings containing emoji.
+#' implements newer versions the Unicode databases.  Do not expect the `fansi`
+#' width calculations to always work correctly with strings containing emoji.
 #'
-#' Internally, `fansi` computes the width of every UTF-8 character sequence
+#' Internally, `fansi` computes the width of most UTF-8 character sequences
 #' outside of the ASCII range using the native `R_nchar` function.  This will
-#' cause such characters to be processed slower than ASCII characters.
-#' Additionally, `fansi` character width computations can differ from R width
-#' computations despite the use of `R_nchar`. `fansi` always computes width for
-#' each character individually, which assumes that the sum of the widths of each
-#' character is equal to the width of a sequence.  However, it is theoretically
-#' possible for a character sequence that forms a single grapheme to break that
-#' assumption. In informal testing we have found this to be rare because in the
-#' most common multi-character graphemes the trailing characters are computed as
-#' zero width.
-#'
-#' As of R 3.4.0 `substr` appears to use UTF-8 character byte sizes as indicated
-#' by the leading byte, irrespective of whether the subsequent bytes lead to a
-#' valid sequence.  Additionally, UTF-8 byte sequences as long as 5 or 6 bytes
-#' may be allowed, which is likely a holdover from older Unicode versions.
-#' `fansi` mimics this behavior.  It is likely `substr` will start failing with
-#' invalid UTF-8 byte sequences with R 3.6.0 (as per SVN r74488).  In general,
-#' you should assume that `fansi` may not replicate base R exactly when there
-#' are illegal UTF-8 sequences present.
-#'
-#' Our long term objective is to implement proper UTF-8 character width
-#' computations, but for simplicity and also because R and our terminal do not
-#' do it properly either we are deferring the issue for now.
+#' cause such characters to be processed slower than ASCII characters.  `fansi`
+#' also attempts to approximate the effect of emoji combining sequences on
+#' string widths, which R does not at least as of R 4.1.  The
+#' [`utf8`](https://cran.r-project.org/package=utf8) package provides a
+#' conforming grapheme parsing implementation.
+#'
+#' Because `fansi` implements it's own internal UTF-8 parsing it is possible
+#' that you will see results different from those that R produces even on
+#' strings without _Control Sequences_.
 #'
 #' @section Overflow:
 #'
 #' The maximum length of input character vector elements allowed by `fansi` is
-#' the 32 bit INT_MAX, excluding the terminating NULL.  This appears to be the
+#' the 32 bit INT_MAX, excluding the terminating NULL.  As of R4.1 this is the
 #' limit for R character vector elements generally, but is enforced at the C
-#' level nonetheless.
+#' level by `fansi` nonetheless.
 #'
 #' It is possible that during processing strings that are shorter than INT_MAX
 #' would become longer than that. `fansi` checks for that overflow and will

diff --git a/R/sgr.R b/R/sgr.R
@@ -179,5 +179,8 @@ close_sgr <- function(
 ##
 ## This is to simulate what `strwrap` does, exposed for testing purposes.
 
-process <- function(x) .Call(FANSI_process, enc2utf8(x))
+process <- function(x, ctl="all")
+  .Call(
+    FANSI_process, enc2utf8(x), seq_along(VALID.TERM.CAP), match(ctl, VALID.CTL)
+  )
 
diff --git a/R/strwrap.R b/R/strwrap.R
@@ -50,7 +50,7 @@
 #' @param strip.spaces TRUE (default) or FALSE, if TRUE, extraneous white spaces
 #'   (spaces, newlines, tabs) are removed in the same way as [base::strwrap]
 #'   does.  When FALSE, whitespaces are preserved, except for newlines as those
-#'   are implicit in boundaries between vector elements.
+#'   are implicit boundaries between output vector elements.
 #' @param tabs.as.spaces FALSE (default) or TRUE, whether to convert tabs to
 #'   spaces.  This can only be set to TRUE if `strip.spaces` is FALSE.
 #' @note For the `strwrap*` functions the `carry` parameter affects whether

diff --git a/R/substr2.R b/R/substr2.R
@@ -63,6 +63,7 @@
 #'
 #' @note Non-ASCII strings are converted to and returned in UTF-8 encoding.
 #'   Width calculations will not work properly in R < 3.2.2.
+#' @note If `stop` < `start`, the return value is always an empty string.
 #' @inheritParams base::substr
 #' @export
 #' @seealso [`?fansi`][fansi] for details on how _Control Sequences_ are
@@ -72,15 +73,17 @@
 #'   to compute the SGR required to close active SGR.
 #' @param x a character vector or object that can be coerced to such.
 #' @param type character(1L) partial matching `c("chars", "width")`, although
-#'   `type="width"` only works correctly with R >= 3.2.2.  With "width", whether
-#'   C0 and C1 are treated as zero width may depend on R version and locale in
-#'   addition what the `ctl` parameter is set to.  For example, for R4.1 in
-#'   UTF-8 locales C0 and C1 will be zero width even if the value of `ctl` is
-#'   such that they wouldn't be so in other circumstances.
+#'   `type="width"` only works correctly with R >= 3.2.2.  See
+#'   [`?nchar`][base::nchar]. With "width", the results might be affected by
+#'   locale changes, Unicode database updates, and logic changes for processing
+#'   of complex graphemes.  Generally you should not rely on a specific output
+#'   e.g. by embedding it in unit tests.  For the most part `fansi` (currently)
+#'   uses the internals of `base::nchar(type='width')`, but there are exceptions
+#'   and this may change in the future.
 #' @param round character(1L) partial matching
 #'   `c("start", "stop", "both", "neither")`, controls how to resolve
 #'   ambiguities when a `start` or `stop` value in "width" `type` mode falls
-#'   within a multi-byte character or a wide display character.  See details.
+#'   within a wide display character.  See details.
 #' @param tabs.as.spaces FALSE (default) or TRUE, whether to convert tabs to
 #'   spaces.  This can only be set to TRUE if `strip.spaces` is FALSE.
 #' @param tab.stops integer(1:n) indicating position of tab stops to use
@@ -103,6 +106,9 @@
 #'   problematic _Control Sequences_ are encountered.  These could cause the
 #'   assumptions `fansi` makes about how strings are rendered on your display
 #'   to be incorrect, for example by moving the cursor (see [`?fansi`][fansi]).
+#'   If the problematic sequence is a tab, you can use the `tabs.as.spaces`
+#'   parameter on functions that have it, or the `tabs_as_spaces` function, to
+#'   turn the tabs to spaces and resolve the warning that way.
 #' @param term.cap character a vector of the capabilities of the terminal, can
 #'   be any combination of "bright" (SGR codes 90-97, 100-107), "256" (SGR codes
 #'   starting with "38;5" or "48;5"), and "truecolor" (SGR codes starting with
@@ -269,6 +275,9 @@ substr2_sgr <- function(
 
 ## @x must already have been converted to UTF8
 ## @param type.int is supposed to be the matched version of type, minus 1
+##
+## Increasingly, it seems trying to re-use the crayon method instead of doing
+## everything in C was a big mistake...
 
 substr_ctl_internal <- function(
   x, start, stop, type.int, round, tabs.as.spaces,
@@ -304,83 +313,92 @@ substr_ctl_internal <- function(
 
   x.scalar <- length(x) == 1
   x.u <- if(x.scalar) x else unique_chr(x)
+  ids <- if(x.scalar) seq_along(s.s.valid) else seq_along(x)
 
   for(u in x.u) {
     elems <- which(x == u & s.s.valid)
     elems.len <- length(elems)
-    e.start <- start[elems]
+    # we want to specify minimum number of position/width elements
+    e.start <- start[elems] - 1L
     e.stop <- stop[elems]
+    e.ids <- ids[elems]
     x.elems <- if(x.scalar) rep(x, length.out=elems.len) else x[elems]
 
     # note, for expediency we're currently assuming that there is no overlap
     # between starts and stops
-
     e.order <- forder(c(e.start, e.stop))
 
-    e.lag <- rep(c(round.start, round.stop), each=elems.len)[e.order]
-    e.ends <- rep(c(FALSE, TRUE), each=elems.len)[e.order]
+    e.keep <- rep(c(!round.start, round.stop), each=elems.len)[e.order]
     e.sort <- c(e.start, e.stop)[e.order]
 
     state <- .Call(
       FANSI_state_at_pos_ext,
-      u, e.sort - 1L, type.int,
-      e.lag,   # whether to include a partially covered multi-byte character
-      e.ends,  # whether it's a start or end position
+      u, e.sort, type.int,
+      e.keep,  # whether to include a partially covered multi-byte character
+      rep(c(TRUE, FALSE), each=length(elems))[e.order], # start or end of string
       warn, term.cap.int,
-      ctl.int, normalize, carry
+      ctl.int, normalize, terminate,
+      c(e.ids, e.ids)[e.order]
     )
     # Recover the matching values for e.sort
-
-    e.unsort.idx <- match(seq_along(e.order), e.order)
+    e.unsort.idx <- match(seq_along(e.order), e.order)  # e.order[e.order]?
     start.stop.ansi.idx <- .Call(FANSI_cleave, e.unsort.idx)
     start.ansi.idx <- start.stop.ansi.idx[[1L]]
     stop.ansi.idx <- start.stop.ansi.idx[[2L]]
 
     # And use those to substr with
-
-    start.ansi <- state[[2]][3, start.ansi.idx]
+    start.ansi <- state[[2]][3, start.ansi.idx] + 1L
     stop.ansi <- state[[2]][3, stop.ansi.idx]
     start.tag <- state[[1]][start.ansi.idx]
     stop.tag <- state[[1]][stop.ansi.idx]
 
-    # if there is any ANSI CSI at end then add a terminating CSI, warnings
-    # should have been issued on first read
-
-    end.csi <-
-      if(terminate) close_sgr(stop.tag, warn=FALSE, normalize)
-      else ""
-    tmp <- paste0(
-      start.tag,
-      substr(x.elems, start.ansi, stop.ansi)
-    )
-    res[elems] <- paste0(
-      if(normalize)
-        normalize_sgr(tmp, warn=warn, term.cap=VALID.TERM.CAP[term.cap.int])
-      else tmp, end.csi
-    )
-  }
+    # It's possible to end up with starts after stops because starts always
+    # ingest trailing SGR.
+    empty.req <- e.start >= e.stop
+    empty.res <- !empty.req & start.ansi > stop.ansi
+    if(!terminate) res[elems[empty.res]] <- start.tag[empty.res]
+
+    # Finalize real substrings
+    full <- !empty.res & !empty.req
+    if(any(full)) {
+      # if there is any ANSI CSI at end then add a terminating CSI, warnings
+      # should have been issued on first read
+      end.csi <-
+        if(terminate) close_sgr(stop.tag[full], warn=FALSE, normalize)
+        else ""
+
+      substring <- substr(x.elems[full], start.ansi[full], stop.ansi[full])
+      tmp <- paste0(start.tag[full], substring)
+      term.cap <- VALID.TERM.CAP[term.cap.int]
+      res[elems[full]] <- paste0(
+        if(normalize) normalize_sgr(tmp, warn=FALSE, term.cap=term.cap)
+        else tmp,
+        end.csi
+  ) } }
   res
 }
 
 ## Need to expose this so we can test bad UTF8 handling because substr will
-## behave different with bad UTF8 pre and post R 3.6.0
+## behave different with bad UTF8 pre and post R 3.6.0.  Make sure things
+## are sorted properly given starts are input -1L.
 
 state_at_pos <- function(
   x, starts, ends, warn=getOption('fansi.warn'),
   normalize=getOption('fansi.normalize', FALSE),
-  carry=getOption('fansi.carry', FALSE)
+  terminate=getOption('fansi.terminate', FALSE)
 ) {
   is.start <- c(rep(TRUE, length(starts)), rep(FALSE, length(ends)))
   .Call(
     FANSI_state_at_pos_ext,
-    x, as.integer(c(starts, ends)) - 1L,
-    0L,      # character type
-    is.start,  # lags
-    !is.start, # ends
+    x, as.integer(c(starts - 1L, ends)),
+    0L,        # character type
+    is.start,  # keep if is.start
+    is.start,  # indicate that it's a start
     warn,
     seq_along(VALID.TERM.CAP),
     1L,        # ctl="all"
     normalize,
-    carry
+    terminate,
+    rep(seq_along(starts), 2)
   )
 }
diff --git a/R/tohtml.R b/R/tohtml.R
@@ -312,7 +312,7 @@ in_html <- function(x, css=character(), pre=TRUE, display=TRUE, clean=display) {
     if(pre) "</pre>",
     "</body>", "</html>"
   )
-  f <- tempfile()
+  f <- paste0(tempfile(), ".html")
   writeLines(html, f)
   if(display) browseURL(f)  # nocov, can't do this in tests
   if(clean) {