Skip to content

Version 0.6.2

Choose a tag to compare
@kermitt2 kermitt2 released this 20 Mar 01:23
· 1204 commits to master since this release


  • Docker image covering both Deep Learning and CRF models, with GPU detection and preloading of embeddings
  • For Deep Learning models, labeling is now done by batch: application of the citation DL model is 4 times faster for BidLSTM-CRF (with or without features) and 6 times faster for SciBERT
  • More tests for sentence segmentation
  • Add orcid of persons when available from the PDF or via consolidation (i.e. if in CrossRef metadata)
  • Add BidLSTM-CRF-FEATURES header model (with feature channel)
  • Add bioRxiv end-to-end evaluation
  • Bounding boxes for optional section titles coordinates


  • Reduce the size of docker images
  • Improve end-to-end evaluation: multithreaded processing of PDF, progress bar, output the evaluation report in markdown format
  • Update of several models covering CRF, BidLSTM-CRF and BidLSTM-CRF-FEATURES, mainly improving citation and author recognitions
  • OpenNLP is the default optional sentence segmenter (similar result as Pragmatic Segmenter for scholar documents after benchmarking, but 30 times faster)
  • Refine sentence segmentation to exploit layout information and predicted reference callouts
  • Update jep version to 3.9.1


  • Ignore invalid utf-8 sequences
  • Update CrossRef multithreaded calls to avoid using the unreliable time interval returned by the CrossRef REST API service, update usage of Crossref-Plus-API-Token and update the deprecated crossref field query.title
  • Missing last table or figure when generating training data for the fulltext model
  • Fix an error related to the feature value for the reference callout for the fulltext model
  • Review/correct DeLFT configuration documentation, with a step-by-step configuration documentation
  • Other minor fixes