Skip to content

Commit

Permalink
Fix #66: graphemes, ...
Browse files Browse the repository at this point in the history
Also more consistent treatment of leading and trailing
spaces and SGR, and many other fixes/cleanup
  • Loading branch information
brodieG committed Jun 27, 2021
2 parents e9a7e55 + b4a2cfc commit ea44713
Show file tree
Hide file tree
Showing 51 changed files with 1,141 additions and 644 deletions.
302 changes: 198 additions & 104 deletions DEVNOTES.md → DEVNOTES.Rmd

Large diffs are not rendered by default.

33 changes: 31 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,43 @@

### Features

* [#66](https://github.com/brodieG/fansi/issues/66) Improved handling of
graphemes in `type="width"` mode. Flags and well formed emoji sequences
should have widths computed correctly in most common use cases.
* [#64](https://github.com/brodieG/fansi/issues/64) New function `normalize_sgr`
converts compound SGR sequences into normalized form (e.g. "ESC[44;31m"
becomes "ESC[31mESC[44m") for better compatibility with
[`crayon`](https://github.com/r-lib/crayon). Additionally, most functions
gain a `normalize` parameter so that they may return their output in
normalized form.
* [#71](https://github.com/brodieG/fansi/issues/71) Functions that write SGR are
now more parsimonious.
* `html_esc` gains a `what` parameter to indicate which HTML special characters
should be escaped.
* Many functions gain `carry` and `terminate` parameters to control how `fansi`
generated substrings interact with surrounding formats.
* New function `state_at_end` to compute active SGR state at end of a string.
* New function `close_sgr` to generate a closing SGR sequence given an active
SGR state.
* [#71](https://github.com/brodieG/fansi/issues/71) Functions that write SGR are
now more parsimonious (see "Behavior Changes" below).

### Behavior Changes

A big part of the 1.0 release is an extensive refactoring of many parts of the
ANSI CSI SGR intake and output algorithms. In some cases this means that some
`fansi` functions will output SGR slightly differently than they did before. In
almost all cases the rendering of the SGR should remain unchanged, although
there are some corner cases with changes (e.g. in `strwrap_ctl` SGRs embedded in
whitespace sequences don't break the sequence).

The changes are a side effect of applying more consistent treatment of corner
cases around leading and trailing SGR in substrings. Trailing SGR in the output
is now omitted as it would be immediately closed (assuming `terminate=TRUE`, the
default). Leading SGR is interpreted and re-output.

Normally output consistency alone would not be a reason to change behavior, but
in this case the changes should be almost always undetectable in the
**rendered** output, and maintaining old behavior would further complicate
finicky C string manipulation code.

### Bug Fixes

Expand All @@ -22,6 +49,8 @@

### Internal Changes

* More aggressive UTF-8 validation, also, invalid UTF-8 now advance only one
byte instead of their putative width based on a valid initial byte.
* Reduce peak memory usage by making some intermediate buffers eligible for
garbage collection prior to native code returning to R.
* Reworked internals to simplify buffer computation and synchronization.
Expand Down
99 changes: 50 additions & 49 deletions R/fansi-package.R
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,10 @@
#' `fansi` will will warn if it encounters _Control Sequences_ that it cannot
#' interpret or that might conflict with terminal capabilities. You can turn
#' off warnings via the `warn` parameter or via the "fansi.warn" global option.
#' Any SGR codes that it interprets and re-outputs in substrings will be
#' compatible with the specified terminal capabilities; however, some parts of
#' substrings are copied as-is and those will retain the original unsupported
#' SGR codes.
#'
#' `fansi` can work around "C0" tab control characters by turning them into
#' spaces first with [`tabs_as_spaces`] or with the `tabs.as.spaces` parameter
Expand All @@ -116,38 +120,48 @@
#' the effect is the same as replacement (e.g. if you have a color active and
#' pick another one).
#'
#' While we try to minimize changes across `fansi` versions in how SGR sequences
#' are output, we focus on minimizing the changes to rendered output, not
#' necessarily the specific SGR sequences used to produce it. To maximize the
#' odds of getting stable SGR output use [`normalize_sgr`] and set `term.cap` to
#' a specific set of capabilities. In general it is likely best not to rely on
#' the exact SGR encoding of `fansi` output.
#'
#' Note that `width` calculations may also change across R versions, locales,
#' etc. (see "Encodings / UTF-8" below).
#'
#' @section SGR Interactions:
#'
#' The cumulative nature of SGR means that SGR in strings that are spliced will
#' interact with each other. Additionally, a substring does not inherently
#' contain all the information required to recreate its formatting as it
#' appeared in its source string.
#'
#' One form of possible interaction to consider is how a character vector
#' provided to `fansi` functions interacts with itself. By default, `fansi`
#' assumes that each element in an input character vector is independent, but
#' this is incorrect if the input is a single document with each element a line
#' in it. In that situation unterminated SGR from each line should bleed into
#' subsequent ones. Setting `carry = TRUE` enables the "single document"
#' interpretation. [`sgr_to_html`] is the exception as for legacy reasons it
#' defaults to `carry = TRUE`.
#' One form of interaction to consider is how a character vector provided to
#' `fansi` functions affect itself. By default, `fansi` assumes that each
#' element in an input character vector is independent, but this is incorrect if
#' the input is a single document with each element a line in it. In that
#' situation unterminated SGR codes from each line should bleed into subsequent
#' ones. Setting `carry = TRUE` enables the "single document" interpretation.
#' [`sgr_to_html`] is the exception as for legacy reasons it defaults to `carry
#' = TRUE`.
#'
#' Another form of interaction is when substrings produced by `fansi` are
#' spliced with or into other substrings. By default `fansi` automatically
#' terminates substrings it produces if they contain active SGR. This prevents
#' the SGR therein from affecting display of external strings, which is useful
#' e.g. when arranging text in columns. We can allow the SGR to bleed into
#' appended strings by setting `terminate = FALSE`. `carry` is unaffected by
#' `terminate` as `fansi` records the ending SGR state prior to termination
#' internally.
#'
#' Finally, `fansi` strings will be affected by any active SGR in strings they
#' are appended to. There are no parameters to control what happens
#' automatically in this case, but `fansi` provides several functions that can
#' help the user get their desired outcome. `sgr_at_end` computes the active
#' SGR at the end of a string, this can then be prepended onto the _input_ of
#' `fansi` functions so that they are aware of what the active style at the
#' beginning of the string. Alternatively, one could use
#' terminates substrings it produces if they contain active SGR formats. This
#' prevents the SGR formats therein from affecting display of external strings,
#' which is useful e.g. when arranging text in columns. We can allow the SGR
#' formats to bleed into appended strings by setting `terminate = FALSE`.
#' `carry` is unaffected by `terminate` as `fansi` records the ending SGR state
#' prior to termination internally.
#'
#' Finally, `fansi` strings will be affected by any active SGR formats in
#' strings they are appended to. There are no parameters to control what
#' happens automatically in this case, but `fansi` provides several functions
#' that can help the user get their desired outcome. `sgr_at_end` computes the
#' active SGR at the end of a string, this can then be prepended onto the
#' _input_ of `fansi` functions so that they are aware of the active style
#' at the beginning of the string. Alternatively, one could use
#' `close_sgr(sgr_at_end(...))` and pre-pend that to the _output_ of `fansi`
#' functions so they are unaffected by preceding SGR. One could also just
#' prepend "ESC[0m", but in some cases as described in
Expand All @@ -171,40 +185,27 @@
#' These issues are most likely to occur with invalid UTF-8 sequences,
#' combining character sequences, and emoji. For example, whether special
#' characters such as emoji are considered one or two wide evolves as software
#' adopts newer versions of Unicode. Do not expect the `fansi` width
#' calculations to always work correctly with strings containing emoji.
#' implements newer versions the Unicode databases. Do not expect the `fansi`
#' width calculations to always work correctly with strings containing emoji.
#'
#' Internally, `fansi` computes the width of every UTF-8 character sequence
#' Internally, `fansi` computes the width of most UTF-8 character sequences
#' outside of the ASCII range using the native `R_nchar` function. This will
#' cause such characters to be processed slower than ASCII characters.
#' Additionally, `fansi` character width computations can differ from R width
#' computations despite the use of `R_nchar`. `fansi` always computes width for
#' each character individually, which assumes that the sum of the widths of each
#' character is equal to the width of a sequence. However, it is theoretically
#' possible for a character sequence that forms a single grapheme to break that
#' assumption. In informal testing we have found this to be rare because in the
#' most common multi-character graphemes the trailing characters are computed as
#' zero width.
#'
#' As of R 3.4.0 `substr` appears to use UTF-8 character byte sizes as indicated
#' by the leading byte, irrespective of whether the subsequent bytes lead to a
#' valid sequence. Additionally, UTF-8 byte sequences as long as 5 or 6 bytes
#' may be allowed, which is likely a holdover from older Unicode versions.
#' `fansi` mimics this behavior. It is likely `substr` will start failing with
#' invalid UTF-8 byte sequences with R 3.6.0 (as per SVN r74488). In general,
#' you should assume that `fansi` may not replicate base R exactly when there
#' are illegal UTF-8 sequences present.
#'
#' Our long term objective is to implement proper UTF-8 character width
#' computations, but for simplicity and also because R and our terminal do not
#' do it properly either we are deferring the issue for now.
#' cause such characters to be processed slower than ASCII characters. `fansi`
#' also attempts to approximate the effect of emoji combining sequences on
#' string widths, which R does not at least as of R 4.1. The
#' [`utf8`](https://cran.r-project.org/package=utf8) package provides a
#' conforming grapheme parsing implementation.
#'
#' Because `fansi` implements it's own internal UTF-8 parsing it is possible
#' that you will see results different from those that R produces even on
#' strings without _Control Sequences_.
#'
#' @section Overflow:
#'
#' The maximum length of input character vector elements allowed by `fansi` is
#' the 32 bit INT_MAX, excluding the terminating NULL. This appears to be the
#' the 32 bit INT_MAX, excluding the terminating NULL. As of R4.1 this is the
#' limit for R character vector elements generally, but is enforced at the C
#' level nonetheless.
#' level by `fansi` nonetheless.
#'
#' It is possible that during processing strings that are shorter than INT_MAX
#' would become longer than that. `fansi` checks for that overflow and will
Expand Down
5 changes: 4 additions & 1 deletion R/sgr.R
Original file line number Diff line number Diff line change
Expand Up @@ -179,5 +179,8 @@ close_sgr <- function(
##
## This is to simulate what `strwrap` does, exposed for testing purposes.

process <- function(x) .Call(FANSI_process, enc2utf8(x))
process <- function(x, ctl="all")
.Call(
FANSI_process, enc2utf8(x), seq_along(VALID.TERM.CAP), match(ctl, VALID.CTL)
)

2 changes: 1 addition & 1 deletion R/strwrap.R
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
#' @param strip.spaces TRUE (default) or FALSE, if TRUE, extraneous white spaces
#' (spaces, newlines, tabs) are removed in the same way as [base::strwrap]
#' does. When FALSE, whitespaces are preserved, except for newlines as those
#' are implicit in boundaries between vector elements.
#' are implicit boundaries between output vector elements.
#' @param tabs.as.spaces FALSE (default) or TRUE, whether to convert tabs to
#' spaces. This can only be set to TRUE if `strip.spaces` is FALSE.
#' @note For the `strwrap*` functions the `carry` parameter affects whether
Expand Down
100 changes: 59 additions & 41 deletions R/substr2.R
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@
#'
#' @note Non-ASCII strings are converted to and returned in UTF-8 encoding.
#' Width calculations will not work properly in R < 3.2.2.
#' @note If `stop` < `start`, the return value is always an empty string.
#' @inheritParams base::substr
#' @export
#' @seealso [`?fansi`][fansi] for details on how _Control Sequences_ are
Expand All @@ -72,15 +73,17 @@
#' to compute the SGR required to close active SGR.
#' @param x a character vector or object that can be coerced to such.
#' @param type character(1L) partial matching `c("chars", "width")`, although
#' `type="width"` only works correctly with R >= 3.2.2. With "width", whether
#' C0 and C1 are treated as zero width may depend on R version and locale in
#' addition what the `ctl` parameter is set to. For example, for R4.1 in
#' UTF-8 locales C0 and C1 will be zero width even if the value of `ctl` is
#' such that they wouldn't be so in other circumstances.
#' `type="width"` only works correctly with R >= 3.2.2. See
#' [`?nchar`][base::nchar]. With "width", the results might be affected by
#' locale changes, Unicode database updates, and logic changes for processing
#' of complex graphemes. Generally you should not rely on a specific output
#' e.g. by embedding it in unit tests. For the most part `fansi` (currently)
#' uses the internals of `base::nchar(type='width')`, but there are exceptions
#' and this may change in the future.
#' @param round character(1L) partial matching
#' `c("start", "stop", "both", "neither")`, controls how to resolve
#' ambiguities when a `start` or `stop` value in "width" `type` mode falls
#' within a multi-byte character or a wide display character. See details.
#' within a wide display character. See details.
#' @param tabs.as.spaces FALSE (default) or TRUE, whether to convert tabs to
#' spaces. This can only be set to TRUE if `strip.spaces` is FALSE.
#' @param tab.stops integer(1:n) indicating position of tab stops to use
Expand All @@ -103,6 +106,9 @@
#' problematic _Control Sequences_ are encountered. These could cause the
#' assumptions `fansi` makes about how strings are rendered on your display
#' to be incorrect, for example by moving the cursor (see [`?fansi`][fansi]).
#' If the problematic sequence is a tab, you can use the `tabs.as.spaces`
#' parameter on functions that have it, or the `tabs_as_spaces` function, to
#' turn the tabs to spaces and resolve the warning that way.
#' @param term.cap character a vector of the capabilities of the terminal, can
#' be any combination of "bright" (SGR codes 90-97, 100-107), "256" (SGR codes
#' starting with "38;5" or "48;5"), and "truecolor" (SGR codes starting with
Expand Down Expand Up @@ -269,6 +275,9 @@ substr2_sgr <- function(

## @x must already have been converted to UTF8
## @param type.int is supposed to be the matched version of type, minus 1
##
## Increasingly, it seems trying to re-use the crayon method instead of doing
## everything in C was a big mistake...

substr_ctl_internal <- function(
x, start, stop, type.int, round, tabs.as.spaces,
Expand Down Expand Up @@ -304,83 +313,92 @@ substr_ctl_internal <- function(

x.scalar <- length(x) == 1
x.u <- if(x.scalar) x else unique_chr(x)
ids <- if(x.scalar) seq_along(s.s.valid) else seq_along(x)

for(u in x.u) {
elems <- which(x == u & s.s.valid)
elems.len <- length(elems)
e.start <- start[elems]
# we want to specify minimum number of position/width elements
e.start <- start[elems] - 1L
e.stop <- stop[elems]
e.ids <- ids[elems]
x.elems <- if(x.scalar) rep(x, length.out=elems.len) else x[elems]

# note, for expediency we're currently assuming that there is no overlap
# between starts and stops

e.order <- forder(c(e.start, e.stop))

e.lag <- rep(c(round.start, round.stop), each=elems.len)[e.order]
e.ends <- rep(c(FALSE, TRUE), each=elems.len)[e.order]
e.keep <- rep(c(!round.start, round.stop), each=elems.len)[e.order]
e.sort <- c(e.start, e.stop)[e.order]

state <- .Call(
FANSI_state_at_pos_ext,
u, e.sort - 1L, type.int,
e.lag, # whether to include a partially covered multi-byte character
e.ends, # whether it's a start or end position
u, e.sort, type.int,
e.keep, # whether to include a partially covered multi-byte character
rep(c(TRUE, FALSE), each=length(elems))[e.order], # start or end of string
warn, term.cap.int,
ctl.int, normalize, carry
ctl.int, normalize, terminate,
c(e.ids, e.ids)[e.order]
)
# Recover the matching values for e.sort

e.unsort.idx <- match(seq_along(e.order), e.order)
e.unsort.idx <- match(seq_along(e.order), e.order) # e.order[e.order]?
start.stop.ansi.idx <- .Call(FANSI_cleave, e.unsort.idx)
start.ansi.idx <- start.stop.ansi.idx[[1L]]
stop.ansi.idx <- start.stop.ansi.idx[[2L]]

# And use those to substr with

start.ansi <- state[[2]][3, start.ansi.idx]
start.ansi <- state[[2]][3, start.ansi.idx] + 1L
stop.ansi <- state[[2]][3, stop.ansi.idx]
start.tag <- state[[1]][start.ansi.idx]
stop.tag <- state[[1]][stop.ansi.idx]

# if there is any ANSI CSI at end then add a terminating CSI, warnings
# should have been issued on first read

end.csi <-
if(terminate) close_sgr(stop.tag, warn=FALSE, normalize)
else ""
tmp <- paste0(
start.tag,
substr(x.elems, start.ansi, stop.ansi)
)
res[elems] <- paste0(
if(normalize)
normalize_sgr(tmp, warn=warn, term.cap=VALID.TERM.CAP[term.cap.int])
else tmp, end.csi
)
}
# It's possible to end up with starts after stops because starts always
# ingest trailing SGR.
empty.req <- e.start >= e.stop
empty.res <- !empty.req & start.ansi > stop.ansi
if(!terminate) res[elems[empty.res]] <- start.tag[empty.res]

# Finalize real substrings
full <- !empty.res & !empty.req
if(any(full)) {
# if there is any ANSI CSI at end then add a terminating CSI, warnings
# should have been issued on first read
end.csi <-
if(terminate) close_sgr(stop.tag[full], warn=FALSE, normalize)
else ""

substring <- substr(x.elems[full], start.ansi[full], stop.ansi[full])
tmp <- paste0(start.tag[full], substring)
term.cap <- VALID.TERM.CAP[term.cap.int]
res[elems[full]] <- paste0(
if(normalize) normalize_sgr(tmp, warn=FALSE, term.cap=term.cap)
else tmp,
end.csi
) } }
res
}

## Need to expose this so we can test bad UTF8 handling because substr will
## behave different with bad UTF8 pre and post R 3.6.0
## behave different with bad UTF8 pre and post R 3.6.0. Make sure things
## are sorted properly given starts are input -1L.

state_at_pos <- function(
x, starts, ends, warn=getOption('fansi.warn'),
normalize=getOption('fansi.normalize', FALSE),
carry=getOption('fansi.carry', FALSE)
terminate=getOption('fansi.terminate', FALSE)
) {
is.start <- c(rep(TRUE, length(starts)), rep(FALSE, length(ends)))
.Call(
FANSI_state_at_pos_ext,
x, as.integer(c(starts, ends)) - 1L,
0L, # character type
is.start, # lags
!is.start, # ends
x, as.integer(c(starts - 1L, ends)),
0L, # character type
is.start, # keep if is.start
is.start, # indicate that it's a start
warn,
seq_along(VALID.TERM.CAP),
1L, # ctl="all"
normalize,
carry
terminate,
rep(seq_along(starts), 2)
)
}
2 changes: 1 addition & 1 deletion R/tohtml.R
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,7 @@ in_html <- function(x, css=character(), pre=TRUE, display=TRUE, clean=display) {
if(pre) "</pre>",
"</body>", "</html>"
)
f <- tempfile()
f <- paste0(tempfile(), ".html")
writeLines(html, f)
if(display) browseURL(f) # nocov, can't do this in tests
if(clean) {
Expand Down
Loading

0 comments on commit ea44713

Please sign in to comment.