Limit re-fetches of the Zotero library #58

retorquere · 2020-01-13T08:37:12Z

It looks like the number of times citr requests the full library from BBT can be optimized. For large libraries this should yield a performance improvement.

I'm open to adding an endpoint in BBT that would allow testing whether the library has changed since last fetch, but to do this effectively, I must understand what triggers a re-read of the BBT-produced bib file, and whether it's cached on the citr end

crsh · 2020-01-15T21:09:03Z

That would be great; the performance of that connection, indeed, leaves room for improvements. ;)

Currently, the bib-file exposed by BBT is only read on-demand if the user connects to BBT for the first time or subsequently requests to reload the library (e.g., because of modified or added references). This gif may give you a rough idea (notice the Reload libraries action link). In between manual requests the library is cached in R and accessed directly. Does that help?

Is it possible that the multiple requests you are seeing target different libraries (main library and group libraries)?

Full disclosure, I'm currently exploring whether a different approach to searching Zotero may be better (in short, BBT CAYW search, zotxt, and pandoc-zotxt). I'm not sure whether this is feasible, though.

retorquere · 2020-01-15T23:24:21Z

The behavior described here sounds reasonable and then I'd see no reason to change anything, but @jrennstich describes clicking connect once, and I see 3 requests for the full library, all for the main user library. I've looked in his DB and there are no groups set up in the copy I have.

The problem for him is exacerbated by a yet-unfixed problem that full library requests take unreasonably long -- this is on me to fix, but his computer took an unfortunate moment (always unfortunate for @jrennstich of course, but unfortunate in the sense that I don't like having open unsolved problems on my plate) to demand repairs.

WRT speeding up bib access -- I've done some recent (5.2.X) performance work that should make fetches substantially less painful, but perhaps not enough for your use-case. BBT exports are relatively heavyweight, and even with a fully filled cache, 24k items take 10-15 seconds to lay out on disk.

pandoc-zotxt should work I think. I can't see why I'd object to this -- BBT is good at solving some problems, not others, and I hold no illusions on how speedy it is 🙄 .

Another option would be to expose an endpoint where citr test whether an auto-export has been set up for a specific path, and set one up if not. That would fully decouple the two while keeping the cooperation in place; potential problem is that you would have to detect when the file on disk changes. The write to the file by BBT is atomic (I write to a temp file and once done it is renamed to the target) so you'd not get partial results, but still. OTOH, in that connect screen it shouldn't be too hard to detect that the file time has been updated since last check.

crsh · 2020-01-23T08:33:51Z

The behavior described here sounds reasonable and then I'd see no reason to change anything, but @jrennstich describes clicking connect once, and I see 3 requests for the full library, all for the main user library. I've looked in his DB and there are no groups set up in the copy I have.

Hmm, I'll have to check dig into this. Unfortunately, I'm completely swamped right now and won't get around to it before April.

Another option would be to expose an endpoint where citr test whether an auto-export has been set up for a specific path, and set one up if not. That would fully decouple the two while keeping the cooperation in place; potential problem is that you would have to detect when the file on disk changes. The write to the file by BBT is atomic (I write to a temp file and once done it is renamed to the target) so you'd not get partial results, but still. OTOH, in that connect screen it shouldn't be too hard to detect that the file time has been updated since last check.

This also sounds like a useful solution to decouple, reloading the bibliography from the addin. Checking when the file changed on disk should be easy enough. Do you think an additional speed-up could be gained from supporting CSL JSON rather than relying on BibTeX as suggested in #59?

retorquere · 2020-01-23T11:40:27Z

I thought CSL JSON was going to easily beat the TeX export formats on speed, but that turns out to be false at the moment. For context, my CSL exporters do barely anything but re-use the existing Zotero CSL converters, but the combination looks to be slower than BBT TeX, which is strange, because the cold-cache version does a lot less than the TeX formats, and the hot-cache scenario should simply be the same, roughly. In any case, there's still benefits to using CSL:

much easier, reliable, and faster to parse for stuff like citekeys, titles, etc.
if you can use CSL in your pipeline then you're probably using either citeproc or pandoc, and in both cases using .bib as a format is actually undesirable -- most likely you're translating zotero -> bibtex -> csl -> bibliography anyhow, and each step before the bibliography is lossy. Much better to just go zotero -> csl -> bibliography

Simple, non-scientific test: export of 24k items:

Better BibTeX, cold cache: 120s
Better BibTeX, hot cache: 17s
Better CSL JSON, cold cache: 278s
Better CSL JSON, hot cache: 41s

I'm going to look into the performance problem with CSL. This should not be the case.

crsh · 2020-01-24T20:25:28Z

Thanks for the benchmark, that's interesting.

I agree using JSON would avoid lossy conversion between formats. I currently use BibTeX because it works with pandoc-citeproc but also with biblatex or natbib, which some users prefer. In this sense, it's a format that's applicable to a wider set of usecases that I have come across.

retorquere · 2020-02-01T10:29:19Z

We've been able to implement some substantial speedups in retorquere/zotero-better-bibtex#1389; I'm doing some tidying up, and then I'll cut a new release in the next few days. But I'm still open to create an endpoint that citr can talk to to set up an auto-export in an automatic way.

It'd also be possible to create an endpoint to query for collections so not the entirely library needs to be fetched, which would net a performance benefit but which would make the UI on the citr side more involved.

retorquere · 2020-02-01T10:39:03Z

I agree using JSON would avoid lossy conversion between formats. I currently use BibTeX because it works with pandoc-citeproc but also with biblatex or natbib, which some users prefer. In this sense, it's a format that's applicable to a wider set of usecases that I have come across.

I don't mean to be (too) pedantic about this, but that's a flexibility win at the cost of a quality loss.

retorquere · 2020-02-10T21:43:39Z

The CSL performance issue has been fixed in 5.2.16.

crsh · 2020-02-10T22:04:19Z

Thanks, I'll take a look the next chance I get!

retorquere mentioned this issue Jan 13, 2020

citr taking long time to access Zotero database with large database retorquere/zotero-better-bibtex#1391

Closed

crsh added the enhancement label Jan 15, 2020

jrennstich mentioned this issue Feb 5, 2020

slow import/reload of Zotero library #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit re-fetches of the Zotero library #58

Limit re-fetches of the Zotero library #58

retorquere commented Jan 13, 2020

crsh commented Jan 15, 2020

retorquere commented Jan 15, 2020

crsh commented Jan 23, 2020

retorquere commented Jan 23, 2020

crsh commented Jan 24, 2020

retorquere commented Feb 1, 2020

retorquere commented Feb 1, 2020

retorquere commented Feb 10, 2020 •

edited

Loading

crsh commented Feb 10, 2020

Limit re-fetches of the Zotero library #58

Limit re-fetches of the Zotero library #58

Comments

retorquere commented Jan 13, 2020

crsh commented Jan 15, 2020

retorquere commented Jan 15, 2020

crsh commented Jan 23, 2020

retorquere commented Jan 23, 2020

crsh commented Jan 24, 2020

retorquere commented Feb 1, 2020

retorquere commented Feb 1, 2020

retorquere commented Feb 10, 2020 • edited Loading

crsh commented Feb 10, 2020

retorquere commented Feb 10, 2020 •

edited

Loading