Skip to content
This repository has been archived by the owner on Apr 30, 2021. It is now read-only.

performance of citation processing #190

Closed
aw-bib opened this issue Nov 9, 2015 · 10 comments
Closed

performance of citation processing #190

aw-bib opened this issue Nov 9, 2015 · 10 comments

Comments

@aw-bib
Copy link

aw-bib commented Nov 9, 2015

I tried recently to convert a book typeset in LaTeX to docx using pandoc. Everything worked out nicely except the references from BibTeX. I was able to strip the original bibtex input from some 1500 references to the used ones, but still with ~400 references pandoc-citeproc seems not to come to an end in processing. Given the fact that it is not possible to further strip down the number of bibtex-entries it would be nice if there could be some other way to handle such bibliographies.

I could have went by with conversion on a chapter basis, but with an input of ~400 entries it even didn't come to an end for a chapter with only 25 references. (I stopped it after some 15min at 100% cpu.)

Besides theses and other scientific books, pandoc would also come in handy for the production of bibliographies in a number of formats. E.g. something along the line of \nocite{*} with a givenbibtex`-input. However, for annual reporting schemes one easily hits several hundreds of publications. #71 does not seem to gain enough here.

I tried pandoc 1.15.1 on linux.

@njbart
Copy link
Contributor

njbart commented Nov 9, 2015

Well, this is what it looks like on my Mid-2011 MacBook Air, with a 1683-item biblatex file:

$ time -p pandoc -s -F pandoc-citeproc -o test.html << EOT
---
bibliography: test.bib
nocite: '@*'
...
EOT

real 58.08
user 56.97
sys 0.86
$ 

So this doesn’t look quite as bad as your report suggests.

Can you process your bib(la)tex files with latex/pdflatex/xelatex/… and bibtex/biber?

Any error messages with pandoc-citeproc -y yourfile.bib?

Any error messages with biber --tool -V yourfile.bib?

@aw-bib
Copy link
Author

aw-bib commented Nov 9, 2015

This sounds interesting indeed.

Can you process your bib(la)tex files with latex/pdflatex/xelatex/… and bibtex/biber?

Yes, in LaTeX everything compiles nicely and I get a bibliography as well.

Are there any known issues where pandoc-citeproc is known to be a bit more picky than e.g. bibtex?

I'll check the suggested tools tonight.

@aw-bib
Copy link
Author

aw-bib commented Nov 9, 2015

Any error messages with biber --tool -V yourfile.bib?

Fixed indeed an error with an invalid key.

As for pandoc-citeproc -y yourfile.bib I see no error message as such. However, Debians version of pandoc (1.12) throws a

Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

It does not allow for the RTS-commands, thus I tried the latest and greatest deb from pandoc (1.15.1). This one starts running, eats some 9GB of RAM and sits there. Any ideas what may eat up the RAM? For me it sounds a bit like a parsing error, but as I've clue what to look for, not knowing what pandoc-citeproc tries to accomplish, I lack the idea what to look for.

@jgm
Copy link
Owner

jgm commented Nov 9, 2015

+++ Alexander Wagner [Nov 09 15 10:25 ]:

Any error messages with biber --tool -V yourfile.bib?

Fixed indeed an error with an invalid key.

As for pandoc-citeproc -y yourfile.bib I see no error message as such.
However, Debians version of pandoc (1.12) throws a
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

It does not allow for the RTS-commands, thus I tried the latest and
greatest deb from pandoc (1.15.1). This one starts running, eats some
9GB of RAM and sits there. Any ideas what may eat up the RAM? For me it
sounds a bit like a parsing error, but as I've clue what to look for,
not knowing what pandoc-citeproc tries to accomplish, I lack the idea
what to look for.

Can you upload your bibtex file somewhere so we can test?

@aw-bib
Copy link
Author

aw-bib commented Nov 10, 2015

Can you upload your bibtex file somewhere so we can test?

Sure. Feel free to fetch it from http://www.desy.de/~arwagner/pandoc-citeproc.bib

@njbart
Copy link
Contributor

njbart commented Nov 10, 2015

Delete CROSSREF = {Walden-2008}, from

@BOOK{Walden-2008,
  CROSSREF  = {Walden-2008},
  EDITION   = {1. publ.},
  EDITOR    = {Scott Walden},
  ISBN      = {9781405139243},
  LOCATION  = {Malden, MA},
  PAGETOTAL = {XII, 325},
  PPN_gvk   = {566382393},
  PUBLISHER = {Blackwell},
  SERIES    = {New directions in aesthetics},
  SUBTITLE  = {essays on the pencil of nature},
  TITLE     = {{P}hotography and philosophy},
  YEAR      = {2008},
}

… and try again.

Quite cleary something you should not have in your data. Not sure whether it’s possible (or worth trying) for pandoc-citeproc to catch this.

@aw-bib
Copy link
Author

aw-bib commented Nov 11, 2015

Ah! A loop, indeed. And of course you're right. How did you find it? I've some dealings with other peoples bibliographies and knowledge about "how to detect errors" come in handy.

@aw-bib
Copy link
Author

aw-bib commented Nov 11, 2015

@nickbart1980 you made my day. :)

300 pages later I can report a working conversion including all bibliographic references. And indeed there is no performance issue, it was indeed just the looping crossref.

Maybe you can comment here on how to find such errors or how you did it.

@aw-bib aw-bib closed this as completed Nov 11, 2015
@njbart
Copy link
Contributor

njbart commented Nov 11, 2015

No special tools, I’m afraid, just vgrep :-)

@jgm
Copy link
Owner

jgm commented Nov 11, 2015

We should probably fix pandoc-citeproc so it doesn't go into an infinite loop even with a loopy bibtex file. So I'll reopen this as a reminder to do that.

@jgm jgm reopened this Nov 11, 2015
@jgm jgm closed this as completed in 6aca99f Nov 11, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants