Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for application/x-bibtex type #532

Merged
merged 11 commits into from
Mar 9, 2020
Merged

Add support for application/x-bibtex type #532

merged 11 commits into from
Mar 9, 2020

Conversation

koppor
Copy link
Contributor

@koppor koppor commented Jan 8, 2020

As I caller, I want to have BibTeX as output (instead of XML). This PR adds that functionality. A caller has to send the HTTP header Accept: application/x-bibtex

The implementation tries to keep the existing code as much as possible. Thus, the modification of the processCitation method with an additional parameter.

koppor referenced this pull request in NikodemKch/grobid Jan 8, 2020
@coveralls
Copy link

coveralls commented Jan 9, 2020

Coverage Status

Coverage increased (+0.2%) to 38.082% when pulling 0b86f3f on koppor:add-bibtex-accept into 5a325e3 on kermitt2:master.

@kermitt2
Copy link
Owner

Sorry Oliver for being so slow to review the PR!

No problem to add the bibtex response type of course.

Just two questions,

  • what is your impression about the current quality of the generated BibTeX? It's something I've written quickly like 7-8 years ago and my worry is that it's not comprehensive enough.

  • do we want, for consistency, to add this response type also to the service /api/processReferences, so extracting all the bibliographical references of a PDF into bibTeX? We can do it later of course.

@koppor
Copy link
Contributor Author

koppor commented Mar 6, 2020

what is your impression about the current quality of the generated BibTeX? It's something I've written quickly like 7-8 years ago and my worry is that it's not comprehensive enough.

It's good enough to integrate it in JabRef.

We have one major issue: The names are not in the format Lastname, Firstname, which the JabRef team thinks is the more consistent way to represent names.

We see that the date parsing is not that easy.

So following example

Kolb, S., Wirtz G.: Towards Application Portability in Platform as a Service
Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE), Oxford, United Kingdom, April 7 - 10, 2014.

appears as follows in JabRef:

@Article{KolbApril7102014,
  author    = {S Kolb and G Wirtz},
  year      = {April 7 - 10, 2014},
  address   = {Oxford, United Kingdom},
  booktitle = {Towards Application Portability in Platform as a Service Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE)},
}

We have no issues in converting ". As soon as we offer JabRef functionalities as library (refs JabRef/jabref#110), we'll come up with a PR updating the BibTeX writing.

Nevertheless, we experienced no issues in parsing the result and integrating it in JabRef.

do we want, for consistency, to add this response type also to the service /api/processReferences, so extracting all the bibliographical references of a PDF into bibTeX? We can do it later of course.

Done. 😅

@koppor
Copy link
Contributor Author

koppor commented Mar 6, 2020

Think, my tests are too straight-forward. Maybe, I should start a REST server and then use http://rest-assured.io/ ^^.

org.grobid.service.process.GrobidRestProcessStringTest > processCitationReturnsBibTeX FAILED

    java.lang.UnsatisfiedLinkError: Native Library /home/travis/build/kermitt2/grobid/grobid-home/lib/lin-64/libwapiti.so already loaded in another classloader

Would it be OK to disable the tests for now to include the functionality in the main branch?

@koppor
Copy link
Contributor Author

koppor commented Mar 8, 2020

I think, in the long run, org.grobid.core.data.BiblioItem#toTEI(int, int, org.grobid.core.engines.config.GrobidAnalysisConfig) should be used - the result then be converted to BibTeX (maybe using XSLT). Reason: That method does much magic - whereas toBibTeX "just" outputs the core data.

@kermitt2
Copy link
Owner

kermitt2 commented Mar 8, 2020

Thank you @koppor for making the test passing!

I think, in the long run, org.grobid.core.data.BiblioItem#toTEI(int, int, org.grobid.core.engines.config.GrobidAnalysisConfig) should be used - the result then be converted to BibTeX (maybe using XSLT). Reason: That method does much magic - whereas toBibTeX "just" outputs the core data.

yes actually this is the whole idea of using TEI as unique output format for GROBID. Then the other legion of "degraded" formats can simply be derived from the TEI via XSLT. It's probably a bad idea to output directly BibTeX given how ambiguous and presentation-oriented the format is (it's why I didn't touch the toBibTeX() method since 7-8 years, and kept it superficial), but I recognize that BibTeX is very useful for researchers and I was happy yo have BibTeX references when I was still writing research papers.

Probably the last thing to be done is to update the web service documentation, under doc/Grobid-service.md :)

@koppor
Copy link
Contributor Author

koppor commented Mar 8, 2020

I wonder whether MODS should be used as XML format. Is there an ADR for TEI ^^.

Meanwhile, I also worked on the BibTeX output heuristics.

I also added/enabled basic tests for the URLs provided at "Service checks"

Linted the markdown file using https://github.com/DavidAnson/markdownlint (the visual studio code plugin: https://github.com/DavidAnson/vscode-markdownlint).

@koppor
Copy link
Contributor Author

koppor commented Mar 8, 2020

See https://github.com/koppor/grobid/blob/add-bibtex-accept/doc/Grobid-service.md#apiprocessreferences for the documentation added. Tried to use monospaced text for parameter names etc.

@kermitt2
Copy link
Owner

kermitt2 commented Mar 8, 2020

I wonder whether MODS should be used as XML format.

Well MODS covers only the biblio, so as GROBID requires to encode a complete text body it's not a valid choice. For the biblio, TEI is actually more comprehensive too, MODS has almost nothing to encode the affiliation information (there are other small issues with MODS).

The only other valid candidate would be JATS, but it's a bit a mess with many flavors/freedom, and it's focusing on article only. For GROBID, we need to cover, full monograph, patents, standards, etc. and TEI is simply more comprehensive. TEI has also nice customization mechanism to define a non-ambiguous encoding.

end of ADR :D

@koppor
Copy link
Contributor Author

koppor commented Mar 8, 2020

Thank you for the ADR ^^.

I did not do any more changes, since both travis and coveralls is green, I think, the update is complete 😇

@kermitt2
Copy link
Owner

kermitt2 commented Mar 8, 2020

I did not do any more changes, since both travis and coveralls is green, I think, the update is complete

Yes and thank you so much for the extensive corrections in the Grobid-service.md file !

@kermitt2
Copy link
Owner

kermitt2 commented Mar 8, 2020

Doing some test, the person.getFirstName() in toBibTeX() can be null apparently:

lopez@work:~/grobid$ curl -X POST -H "Accept: application/x-bibtex" -d "citations=Graff, Expert. Opin. Ther. Targets (2002) 6(1): 103-113" localhost:8070/api/processCitation
@article{-1,
  author = {Graff, null},
  journal = {Expert. Opin. Ther. Targets},
  year = {2002},
  pages = {103--113},
  volume = {6},
  number = {1}
}

@koppor
Copy link
Contributor Author

koppor commented Mar 9, 2020

Fixed. Added test using your example. Added the example to Grobit-service.md.

Also refined Grobit-service.md with consistent command highlighting.

@kermitt2
Copy link
Owner

kermitt2 commented Mar 9, 2020

Thanks a lot Oliver for all the improvements you introduced and the feature contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants