Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding error when parsing BibTeX file with multi-byte characters on Windows #20

Open
hongyuanjia opened this issue May 19, 2018 · 3 comments

Comments

@hongyuanjia
Copy link

Thanks for this great package. I encountered a problem when using bibtex package to parse BibTeX files with Chinese characters on Windows:

# Get current locale info
Sys.getlocale()
#> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

# Set locale to Chinese
Sys.setlocale(locale = "Chinese")
#> [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"

bib_text <- "
    @misc{text,
        title = {{你好}},
        language = {zh-CN},
        author = {{你好}},
        month = jun,
        year = {2013},
        pages = {163}
    }
"
# change encoding to "UTF-8"
bib_text_utf8 <- enc2utf8(bib_text)
Encoding(bib_text_utf8)
#> [1] "UTF-8"

# make sure the saved BibTeX file is UTF-8 encoded
con <- file("test.bib", encoding = "UTF-8")
writeLines(bib_text_utf8, con)
close(con)

readLines("test.bib", encoding = "UTF-8")
#> [1] ""                                "        @misc{text,"            
#> [3] "            title = {{你好}},"   "            language = {zh-CN},"
#> [5] "            author = {{你好}},"  "            month = jun,"       
#> [7] "            year = {2013},"      "            pages = {163}"      
#> [9] "        }"                       "    "                           

read.bib could not parse Chinese characters no matter encoding was set to "UTF-8" or not.

str(bibtex::read.bib("test.bib"))
#> List of 1
#>  $ text:Class 'bibentry'  hidden list of 1
#>   ..$ text:List of 6
#>   .. ..$ title   : chr "{浣犲ソ}"
#>   .. ..$ language: chr "zh-CN"
#>   .. ..$ author  :Class 'person'  hidden list of 1
#>   .. .. ..$ :List of 5
#>   .. .. .. ..$ given  : NULL
#>   .. .. .. ..$ family : chr "浣犲ソ"
#>   .. .. .. ..$ role   : NULL
#>   .. .. .. ..$ email  : NULL
#>   .. .. .. ..$ comment: NULL
#>   .. ..$ month   : chr "jun"
#>   .. ..$ year    : chr "2013"
#>   .. ..$ pages   : chr "163"
#>   .. ..- attr(*, "bibtype")= chr "Misc"
#>   .. ..- attr(*, "key")= chr "text"
#>  - attr(*, "class")= chr "bibentry"
#>  - attr(*, "strings")= Named chr(0) 
#>   ..- attr(*, "names")= chr(0) 

str(bibtex::read.bib("test.bib", encoding = "UTF-8"))
#> List of 1
#>  $ text:Class 'bibentry'  hidden list of 1
#>   ..$ text:List of 6
#>   .. ..$ title   : chr "{浣犲ソ}"
#>   .. ..$ language: chr "zh-CN"
#>   .. ..$ author  :Class 'person'  hidden list of 1
#>   .. .. ..$ :List of 5
#>   .. .. .. ..$ given  : NULL
#>   .. .. .. ..$ family : chr "浣犲ソ"
#>   .. .. .. ..$ role   : NULL
#>   .. .. .. ..$ email  : NULL
#>   .. .. .. ..$ comment: NULL
#>   .. ..$ month   : chr "jun"
#>   .. ..$ year    : chr "2013"
#>   .. ..$ pages   : chr "163"
#>   .. ..- attr(*, "bibtype")= chr "Misc"
#>   .. ..- attr(*, "key")= chr "text"
#>  - attr(*, "class")= chr "bibentry"
#>  - attr(*, "strings")= Named chr(0) 
#>   ..- attr(*, "names")= chr(0) 

Here is my session info

sessionInfo()
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 

After digging a little bit, I found that encode the input of make.bib.entry to "UTF-8" can solve this problem. But I am not sure if this is a proper solution.

devtools::install_github("hongyuanjia/bibtex")
str(bibtex::read.bib("test.bib"))
#> List of 1
#>  $ text:Class 'bibentry'  hidden list of 1
#>   ..$ text:List of 6
#>   .. ..$ title   : chr "{你好}"
#>   .. ..$ language: chr "zh-CN"
#>   .. ..$ author  :Class 'person'  hidden list of 1
#>   .. .. ..$ :List of 5
#>   .. .. .. ..$ given  : NULL
#>   .. .. .. ..$ family : chr "你好"
#>   .. .. .. ..$ role   : NULL
#>   .. .. .. ..$ email  : NULL
#>   .. .. .. ..$ comment: NULL
#>   .. ..$ month   : chr "jun"
#>   .. ..$ year    : chr "2013"
#>   .. ..$ pages   : chr "163"
#>   .. ..- attr(*, "bibtype")= chr "Misc"
#>   .. ..- attr(*, "key")= chr "text"
#>  - attr(*, "class")= chr "bibentry"
#>  - attr(*, "strings")= Named chr(0) 
#>   ..- attr(*, "names")= chr(0) 
@mrustl
Copy link

mrustl commented Dec 11, 2018

I have a similar problem with my bib file (kwb_dummy.txt) on Windows:

### Importing file with default 
bibtex::read.bib(file = "kwb_dummy.txt")

Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.

### Setting encoding to UTF-8 does not change result
bibtex::read.bib(file = "kwb_dummy.txt", encoding = "UTF-8")
Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.

> bibtex::read.bib(file = "kwb_dummy.txt")

Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.

### Correct import with readLines
readLines("kwb_dummy.txt", n = 3, encoding = "UTF-8")
[1] "@article{RN7335,"                                                                                                     
[2] "   author = {Grützmacher, Gesche and Kumar, P.J.Sajil and Rustler, Michael and Hannappel, Stephan and Sauer, U.},"    
[3] "   title = {Geogenic groundwater contamination – definition, occurrence and relevance for drinking water production},"

### System
sessioninfo::session_info()
- Session info ----------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 os       Windows 7 x64 SP 1          
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United Kingdom.1252 
 ctype    English_United Kingdom.1252 
 tz       Europe/Berlin               
 date     2018-12-11                  

- Packages --------------------------------------------------------------------------------
 package     * version date       lib source        
 assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.0)
 bibtex        0.4.2   2017-06-30 [1] CRAN (R 3.5.1)
 cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.1)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)
 digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.1)
 evaluate      0.12    2018-10-09 [1] CRAN (R 3.5.1)
 htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.0)
 httr          1.3.1   2017-08-20 [1] CRAN (R 3.5.0)
 jsonlite      1.6     2018-12-07 [1] CRAN (R 3.5.1)
 knitr         1.20    2018-02-20 [1] CRAN (R 3.5.0)
 lubridate     1.7.4   2018-04-11 [1] CRAN (R 3.5.0)
 magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)
 packrat       0.4.9-3 2018-06-01 [1] CRAN (R 3.5.1)
 plyr          1.8.4   2016-06-08 [1] CRAN (R 3.5.1)
 R6            2.3.0   2018-10-04 [1] CRAN (R 3.5.1)
 Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.0)
 RefManageR    1.2.0   2018-04-25 [1] CRAN (R 3.5.1)
 rmarkdown     1.11    2018-12-08 [1] CRAN (R 3.5.1)
 rstudioapi    0.8     2018-10-02 [1] CRAN (R 3.5.1)
 sessioninfo   1.1.0   2018-09-25 [1] CRAN (R 3.5.1)
 stringi       1.2.4   2018-07-20 [1] CRAN (R 3.5.1)
 stringr       1.3.1   2018-05-10 [1] CRAN (R 3.5.1)
 withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)
 xml2          1.2.0   2018-01-24 [1] CRAN (R 3.5.1)

[1] C:/Users/mrustl.KWB/Documents/R/win-library/3.5
[2] C:/Program Files/R/R-3.5.1/library

@GegznaV
Copy link

GegznaV commented Mar 15, 2020

I can still confirm that there is an encoding issue in bibtex::do_read_bib() and bibtex::read.bib() on Windows:

file <- "book.bib"
encoding <- "UTF-8"
out <- bibtex::do_read_bib(file, encoding = encoding, srcfile(file, encoding = encoding))
out[[1]]

##                                                      address 
##                                                      "Vilnius" 
##                                                         author 
##   "{\\v{C}}ekanavi{\\v{c}}ius, Vydas and Murauskas, Gediminas" 
##                                                          title 
##      "{Taikomoji regresinÄ— analizÄ— socialiniuose tyrimuose}" 

The contents of "book.bib" file:

@book{Cekanavicius2014,
	address = {Vilnius},
	author = {{\v{C}}ekanavi{\v{c}}ius, Vydas and Murauskas, Gediminas},
	title = {{Taikomoji regresinė analizė socialiniuose tyrimuose}},
	year = {2014}
}

An RStudio project for further experimentation: bib-file--UTF-8--issue.zip

@romainfrancois It is quite an old issue. What can be done towards solving it? The solution to this issue would also solve some issues in packages that depend on bibtex including ropensci/RefManageR#66 or crsh/citr#67

@hongyuanjia
Copy link
Author

hongyuanjia commented Nov 9, 2020

Some findings on this:

bibtex::read.bib() is able to read bib files on Windows if bib files were written with native.enc encoding:

Sys.setlocale(locale = "Chinese")
#> [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"

bib_text <-
"
@misc{text,
    title = {{你好}},
    author = {{你好}},
    year = 2020
}
"

# native encoding which is the default on Windows
options(encoding = "native.enc")
writeLines(bib_text, "native.enc.bib")

readLines("native.enc.bib")
# [1] ""                       "@misc{text,"
# [3] "    title = {{你好}},"  "    author = {{你好}},"
# [5] "    year = 2020"        "}"
# [7] ""

# default encoding option "unknown" which is equivalent to "native.enc"
bibtex::read.bib("native.enc.bib", encoding = "unknown") 
# 你好 (2020). "你好."

bibtex::read.bib() is not able to read bib files on Windows if bib files were written with UTF-8 encoding:

# UTF-8 encoding
# NOTE:
# 'native.enc' encoding option is still necessary on Windows to ensure
# writing as UTF-8. useBytes should also set to TRUE to prevent re-encoding the
# text in the file() connection in writeLines()
# See https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
# and https://github.com/yihui/xfun/blob/12e77f58cbee106bfdfb0b288282f47cbf537937/R/io.R#L32
options(encoding = 'native.enc')
writeLines(enc2utf8(bib_text), "utf8.bib", useBytes = TRUE)

readLines("utf8.bib", encoding = "UTF-8")
# [1] ""                           "    @misc{text,"
# [3] "        title = {{你好}},"  "        author = {{你好}},"
# [5] "        year = 2020"        "    }"
# [7] ""

bibtex::read.bib("utf8.bib", encoding = "UTF-8")
# 浣犲ソ (2020). "浣犲ソ

The issue here is that even UTF-8 is selected for the encoding, what bibtex::do_read_bib() still return parsed text as native encoded:

out_native.enc <- .External( "do_read_bib", file = "native.enc.bib", encoding = "unknown", srcfile = srcfile("native.enc.bib", "native.enc") )
out_native.enc
# [[1]]
#    title   author     year 
# "{你好}" "{你好}"   "2020" 
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
# 
# attr(,"include")
# character(0)
# attr(,"strings")
# named character(0)
# attr(,"preamble")
# character(0)

# native encoded which is expected
lapply(out_native.enc, Encoding)
# [[1]]
# [1] "unknown" "unknown" "unknown"
#

out_utf8 <- .External( "do_read_bib", file = "utf8.bib", encoding = "UTF-8", srcfile = srcfile("utf8.bib", "UTF-8") )
out_utf8
# [[1]]
#      title     author       year
# "{浣犲ソ}" "{浣犲ソ}"     "2020"
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
#
# attr(,"include")
# character(0)
# attr(,"strings")
# named character(0)
# attr(,"preamble")
# character(0)

# this is also native encoded
lapply(out_utf8, Encoding)
# [[1]]
# [1] "unknown" "unknown" "unknown"
#

Force the encoding to UTF-8 can fix this issue.

# change to UTF-8
lapply(out_utf8, `Encoding<-`, "UTF-8")
# [[1]]
#    title   author     year
# "{你好}" "{你好}"   "2020"
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
#

Since the do_read_bib() is written in C, it is possible that the default encoding of the input stream is set to "C" locale and fall back to native encoding on Windows. Unfortunately I knew little about C, this is just my guess. This may be verified by changing the encoding option for do_read_bib() and it results in the same parsed tests and encoding:

Encoding(.External( "do_read_bib", file = "native.enc.bib", encoding = "latin1", srcfile = srcfile("native.enc.bib", "native.enc"))[[1]])
# [1] "unknown" "unknown" "unknown"

So in summary, on Windows, it is better to always use native.enc. For those downstream packages that use bibtex::do_read_bib() such as RefManageR::ReadBib(), the default encoding should be set to unknown instead of UTF-8.

I will send a PR to provide a possible fix on the R side.

coatless pushed a commit that referenced this issue Jan 13, 2022
* Upgrade testing suite to testthat 3

* Add testing for previous issues

#45

* Add tests for standard bibtex entries

As defined in BibTEX version 0.99b
https://ctan.javinator9889.com/biblio/bibtex/base/btxdoc.pdf

* Update actions

* Add testing for examples

* Add snapshots for examples read.bib This may fail on some platforms

* Skip on non windows

Possibly a character problem related with
#20 and
#43

* Not test on R 3.4

Some changes in default  parsing (snapshot), but results are still ok

* Add more tests

* Try to increase coverage

* One more tests for do_read_bib

* Fix test for do_read_bib

* Revert actions

* Add devtools for testing

* Add more tests

* Add test for multiline string

* Add non standard field names

* Add myself as author

* Move issues to inst/bib files

* Refactor tests for avoiding cluttering
coatless added a commit that referenced this issue Sep 23, 2022
* Upgrade testing suite to testthat 3

* Add testing for previous issues

#45

* Add tests for standard bibtex entries

As defined in BibTEX version 0.99b
https://ctan.javinator9889.com/biblio/bibtex/base/btxdoc.pdf

* Update actions

* Add testing for examples

* Add snapshots for examples read.bib This may fail on some platforms

* Skip on non windows

Possibly a character problem related with
#20 and
#43

* Not test on R 3.4

Some changes in default  parsing (snapshot), but results are still ok

* Add more tests

* Try to increase coverage

* One more tests for do_read_bib

* Fix test for do_read_bib

* Revert actions

* Add devtools for testing

* Add more tests

* Add test for multiline string

* Add non standard field names

* Add myself as author

* Move issues to inst/bib files

* Refactor tests for avoiding cluttering

* Modify do_read_bib()

Also bump version

* Remove C code :)

* Recreate snapshots on Linux

* Check reverse dependencies

* Add backports

* Deprecate arguments

* Use seq_along

Co-authored-by: James J Balamuta <coatless@users.noreply.github.com>

* Use seq_along again

Co-authored-by: James J Balamuta <coatless@users.noreply.github.com>

* Remove completely header and footer

* Document internal functions

* Rename internal params

* Use vapply and new tests

* Rerun revdeps

* Update docs and snapshots

* Update revdep and check action

Some snapshots changes due to changes on toBibTex() on later versions of R, not related with the package

* Fix action and snapshots

Co-authored-by: James J Balamuta <coatless@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants