crayon doesn't mark encoding on UTF-8 strings in some cases #136

kevinushey · 2022-03-30T21:50:22Z

For example:

library(crayon)
text <- "你好"
crayon::white(text)
crayon::white(crayon::white(text))

I see:

> crayon::white(text)
[1] "\033[37m你好\033[39m"
> crayon::white(crayon::white(text))
[1] "\033[37m\033[37mä½ å¥½\033[37m\033[39m"

Note that the text 你好 in the second example is no longer encoded correctly.

> Encoding(crayon::white(text))
[1] "UTF-8"
> Encoding(crayon::white(crayon::white(text)))
[1] "unknown"

Simply marking the encoding doesn't seem to be sufficient, though:

> white <- crayon::white(crayon::white(text))
> Encoding(white) <- "UTF-8"
> white
[1] "\033[37m\033[37m\xe4� 好\033[37m\033[39m"

so there might be something a little more fundamental going on.

This works as expected with crayon 1.4.2, so appears to be a regression.

> sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 10 x64 (build 22581)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] crayon_1.5.1

loaded via a namespace (and not attached):
[1] compiler_4.1.3 tools_4.1.3

The text was updated successfully, but these errors were encountered:

kevinushey · 2022-03-30T21:59:45Z

It might be related to some recent changes re: gsub(..., useBytes = TRUE):

> text1 <- "你好"     # no quotes
> text2 <- "'你好'"   # has quotes
> gsub("'", "", text1, useBytes = TRUE)
[1] "你好"
> gsub("'", "", text2, useBytes = TRUE)
[1] "ä½ å¥½"

but marking the encoding post-hoc seems sufficient.

> t2 <- gsub("'", "", text2, useBytes = TRUE)
> Encoding(t2) <- "UTF-8"
> t2
[1] "你好"

kevinushey · 2022-05-31T18:48:01Z

The issue no longer occurs with R 4.2.0:

> library(crayon)
> text <- "你好"
> crayon::white(text)
[1] "\033[37m你好\033[39m"
> crayon::white(crayon::white(text))
[1] "\033[37m\033[37m你好\033[37m\033[39m"

and

> text1 <- "你好"     # no quotes
> text2 <- "'你好'"   # has quotes
> gsub("'", "", text1, useBytes = TRUE)
[1] "你好"
> gsub("'", "", text2, useBytes = TRUE)
[1] "你好"

I'm not sure whether supporting older versions of R on Windows is a priority.

gaborcsardi · 2022-09-28T10:05:24Z

I think this is fixed in dev crayon.

kevinushey mentioned this issue Mar 30, 2022

Encoding issue in notebook chunk output (attaching tidyverse packages) rstudio/rstudio#10789

Closed

4 tasks

gaborcsardi added the bug an unexpected problem or unintended behavior label Mar 31, 2022

gaborcsardi closed this as completed Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crayon doesn't mark encoding on UTF-8 strings in some cases #136

crayon doesn't mark encoding on UTF-8 strings in some cases #136

kevinushey commented Mar 30, 2022

kevinushey commented Mar 30, 2022

kevinushey commented May 31, 2022

gaborcsardi commented Sep 28, 2022

crayon doesn't mark encoding on UTF-8 strings in some cases #136

crayon doesn't mark encoding on UTF-8 strings in some cases #136

Comments

kevinushey commented Mar 30, 2022

kevinushey commented Mar 30, 2022

kevinushey commented May 31, 2022

gaborcsardi commented Sep 28, 2022