-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accept application/x-bibtex for processHeaderDocument #800
Accept application/x-bibtex for processHeaderDocument #800
Conversation
Implemented an Importer that querries Grobid for metadata of a pdf. The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet available in Grobid, but we opened a PR that implements it (kermitt2/grobid#800).
This PR is part of #GSoC and is a follow-up to #532. |
Hello @btut ! Thanks for the PR. Did you look how the BibTeX results for the header metadata look like? I vaguely remember that the extracted header fields are not exactly mapped to the same field as for a citation string in the BiblioItem object (because they are handled and normalized differently), so the BibTeX serialization might not work correctly for headers - of course anyway many of the interesting extracted fields will be lost in the very limited BibTeX format as compared to the XML version. So given your feedback, we might need to extend/review the |
Hi!
I checked a handful of pdfs and results seem fine. The only necessary field that seems to be missing is the date. I'll check the toBibTeX method to see what else might be lost.
Sure, but I think the goal for someone that needs BibTeX is to cite it. Aside from the date, the BibTeX entry only lacks author details like e-mail and affiliation. There is no need for these details when citing.
I'll have a look! |
Hello again! Unfortunately, for many papers that I used for testing Grobid did not detect a date at all. This was the case both for BibTeX and TEI. |
I tested this extensively with many pdf's and it looks good! Could we move on with this PR? We would like to use this feature in JabRef. |
Implemented an Importer that querries Grobid for metadata of a pdf. The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet available in Grobid, but we opened a PR that implements it (kermitt2/grobid#800).
grobid-service/src/main/java/org/grobid/service/process/GrobidRestProcessFiles.java
Outdated
Show resolved
Hide resolved
Hello @btut ! I submitted a review for changes: 1) reuse the existing ISO date normalization (more complete - it has been changed/moved recently, so you might need to merge your PR branch with the current master) and 2) review the returned MIME type. Apart from that, I've seen one bug, when the surname is incorrectly recognized as forname, we have a
We might want to output no author at all in the BibTeX in this case? The rest look very good indeed, everything is well mapped as expected! |
…ure/bibtexHeaderAndFulltext
If only a firstname is detected, use it without lastname.
Thanks for your review, @kermitt2!
Great catch!
I think just outputting the firstname is better as it preserves that information at least. I implemented that in 2a69fa4, let me know if you disagree! |
Thanks a lot @btut for the changes ! |
Thanks for your great work! Grobid seems to work very well! |
Too fast (as often), the change is actually not passing the tests because of the way dates are now outputted in the bibtex:
we dont' have a normalized date, because of the day range, so we would have something like this in the bibtex:
and apparently the support of these bibtex flavors depend on the style? |
Is something like this acceptable, with both
note: implemented in PR #807 |
I added some thoughts in #807.
I think this would be a good way to go as it gives all possible data and the user can decide on what to use. I would go the other way around though (split |
TLDR:
Please try to output following: year = {2014},
month = 4,
date = {2014-04-07/2014-04-10}, Long answer:
This is done by the bibtex-variant "biblatex". See https://ftp.rrzn.uni-hannover.de/pub/mirror/tex-archive/macros/latex/contrib/biblatex/doc/biblatex.pdf "iso8601-2 Extended Format specification level 1", which is a "yes" to your answer. Nevertheless, date ranges can be specified: However, normal BibTeX does not use that, it only uses
This should not happen. :)
Yes, they do. However, |
* GrobidPdfMetadataImporter implemented Implemented an Importer that querries Grobid for metadata of a pdf. The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet available in Grobid, but we opened a PR that implements it (kermitt2/grobid#800). * Fixed class when accessing resources * Use FileHelper method to get extension * Use jsoup to issue POST request * Removed unnecessary field * Reverted URLDownload It's no longer necessary to set the POST data by bytes as we use JSoup for that. * Changelog entry * Add pdf link to imported entry * Remove citationkey from Grobid Grobid cannot predict a citationkey * FirstPageImporter * Fixed grammar mistake in CHANGELOG.md Co-authored-by: Christoph <siedlerkiller@gmail.com> * Fixed Grobid tests * Fixed Grobid URL * Checkstyle * Fixed doc * Checkstyle * Use JSoup for plaintext citations as well * Renamed FirstPageImporter to PdfVerbatimBibTextImporter * Fixed getName (no importer) * Renamed Grobid importer to match convention * PdfEmbeddedBibTeXImporter * Renamed PdfEmbeddedBibTeXImporter to PdfEmbeddedBibFileImporter * Checkstyle * Remove debug output * Checkstyle * PdfMergeMetadataImporter * Add DOI and ISBN fetching in PdfMergeMetadataImporter * Fixed concurrent list access * Adapted tests to contain fetchable ID's * Derive XMP preferences from importFormatPreferences * Localization * Use Importers in JabRef * Remove unnecessary test documents * Checkstyle * Grobid Timeout * Null-check * Use MergeImporter as WebFetcher Users can perform a PDF import on already imported pdf's to improve the quality of the entry * Only force BibTeX import if everything else fails Fixes #7984 * Prioritize non-bruteforce importers that When importing, try importers that can tell if they are suitable for a certain file format or not. Some importers only check if a file is present, not if it in the correct format (isRecognizedFormat is always true if an existing file is given). They are used last. The List of importers now reflects that prioritization. It is not sorted by importer names anymore. The getter-methods getImportFormats and getImportFormatList still sort the List by name for the View. * Checkstyle * Fixed WebFetchersTest * Grobid does not need localization * Followup on removed Grobid localization * Fixed tests * Checkstyle * Grobid Fetcher and Tests adapted to updated Grobid * Adapted GrobidServiceTest to updated Grobid Co-authored-by: Christoph <siedlerkiller@gmail.com>
I extended the api to get bibtex for the processHeaderDocument service.
Previousely, AFAIK, it was only possible for /api/processReferences and /api/processCitation services.