Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better approaches to importing of the Zotero database #55

Open
Robinlovelace opened this issue Oct 16, 2019 · 5 comments
Open

Better approaches to importing of the Zotero database #55

Robinlovelace opened this issue Oct 16, 2019 · 5 comments

Comments

@Robinlovelace
Copy link
Contributor

It's frustrating when citr freezes your session so I thought I'd have a play with the future package. Results seem promising so far, so thought I'd report back, having alluded to the potential utility of having the initial bib read running in the background several months ago. Basic concept demonstrated in reprex below. Thoughts: welcome!

# exclude things to make reprex faster
exclude = c("My Library", "energy-and-transport")

# no future
tictoc::tic()
b = citr::load_betterbiblatex_bib(encoding = "UTF-8", exclude_betterbiblatex_library = exclude)
#> Importing 'LIDA-leeds'...
#> Importing 'tds'...
plot(1:9)
tictoc::toc()
#> 0.58 sec elapsed
tictoc::tic()
# do some other work
class(b)
#> [1] "BibEntry" "bibentry"
tictoc::toc()
#> 0.002 sec elapsed


# with future
tictoc::tic()
future::plan("multiprocess")
b = future::future(citr::load_betterbiblatex_bib(encoding = "UTF-8", exclude_betterbiblatex_library = exclude))
plot(1:9)

tictoc::toc()
#> 0.085 sec elapsed
tictoc::tic()
# do some other work
b = future::value(b)
#> Importing 'LIDA-leeds'...
#> Importing 'tds'...
class(b)
#> [1] "BibEntry" "bibentry"
tictoc::toc()
#> 0.322 sec elapsed

Created on 2019-10-16 by the reprex package (v0.3.0)

@Robinlovelace
Copy link
Contributor Author

As a follow-on point, I've just tested out parsing files with the bib2df package and it seems fast.

Timings below on 2000+ .bib file FYI.

system.time({b = bib2df::bib2df("allrefs.bib")})
Some BibTeX entries may have been dropped.
            The result could be malformed.
            Review the .bib file and make sure every single entry starts
            with a '@'.
Column `YEAR` contains character strings.
              No coercion to numeric applied.
   user  system elapsed 
  2.098   0.003   2.112 
Warning message:
In bib2df_tidy(bib, separate_names) : NAs introduced by coercion
> nrow(b)
[1] 2755
> system.time({b2 = citr:::read_bib_catch_error("allrefs.bib")})
<simpleError in RefManageR::ReadBib(x, check = FALSE, .Encoding = encoding): argument "encoding" is missing, with no default>
   user  system elapsed 
  0.108   0.000   0.108 
> system.time({b2 = citr:::read_bib_catch_error("~/uaf/allrefs.bib", )})
x=         encoding=  
> system.time({b2 = citr:::read_bib_catch_error("~/uaf/allrefs.bib", "UTF-8")})
   user  system elapsed 
  7.179   0.093   7.272 

@Robinlovelace
Copy link
Contributor Author

Update: FYI I think the output from that package is not production ready yet. Just food for thought...

@crsh
Copy link
Owner

crsh commented Oct 16, 2019

Hi Robin, thanks for sharing your results. This is actually one of the top two issues I want to tackle next. This looks promising.

Here are some of my thoughts on this. I think there are two major options here to speed up reading from Zotero:

  1. Improve the current approach by possibly speeding up the reading of the bibliography file exposed by BBT by trying bib2df and using future or promises to enable loading the database in the background.

Have you, by chance, looked at promises? They seem to be an alternative to future, but I haven't fully understood the strengths of each approach to decided which way to go on this. bib2df also looks like a promising alternative to RefManageR and bibtex!

  1. Search the Zotero database directly by using the BBT CAYW search (see below) and require users to use the pandoc-zotxt Lua filter with their R Markdown document format (e.g., using rmdfiltr). However, if I understand correctly, this would require installation of zotxt, another Zotero plugin.

I haven't tried zotxt and pandoc-zotxt, but if the bibliography export is fast(er than BBT), this could be the easiest and fastest way to address slow loading of the Zotero database. Hence, I'm leaning towards the second option. This would require some testing and some user interface considerations (would this be a separate addin or could it be integrated with the existing one?).

Just to link to the previous issue on background loading of the Zotero database: #36

@crsh crsh changed the title Support for reading zotero file in background Better approaches to importing of the Zotero database Oct 16, 2019
@Robinlovelace
Copy link
Contributor Author

Not tried promises, in my experience bib2df is buggy. All approaches sound good, I'm excited for this new behaviour and happy to test anything you come up with. Many thanks.

@crsh
Copy link
Owner

crsh commented Oct 16, 2019

After playing around with pandoc-zotxt a little I've come to understand that it requires the global pandoc variable PANDOC_STATE, which was introduced in pandoc 2.4. Currently, RStudio is shipping version 2.3.1, so I'll wait until they ship a newer version before starting to implement and test this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants