-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
processFulltextAssetDocument graphic url incorrectly including temporary (sub-)directory #836
Comments
Thank you @de-code ! I think this is due to a change I made in the last months. |
Maybe that wouldn't be as much of an issue if you only include images that you also created |
Any update on this? I do not want to use pdfalto. I want to use the awesomeness of grobid only. But processFulltextAssetDocument gives incorrect temporary directory URLs (which I am correcting manually) and sometimes it includes images in the zip but misses them the TEI XML (main problem). I am using the latest docker grobid/grobid:0.7.1-SNAPSHOT. |
HI @suyogricha ! Thanks for feedback :) If I remember well, I fixed the specific problem of sub-directory in branch https://github.com/kermitt2/grobid/tree/fix-vector-graphics - however this branch introduces some very significant changes in the way figures and table are detected and structured (we will start from every graphic elements) and introduces a new figure-segmenter model, so it will take several months to have this branch merged. The problem of images in the zip not referenced in the TEI XML is a different problem, and it is the main purpose of this new branch, to improve the figure/table recognition. Also note that it's very common to have quite a lot of images and vector graphics not part of any figures and tables, and all the graphics won't be references anyway in the TEI XML. |
Hi, I have tried the last version and the problem is still there.. "<graphic url="8u5Yhm3d6D.lxml_data/image-1.png" " |
Hi @suyogricha and @ayhama16 ! I've added quickly the fix to the current master with 621f5a1.
|
Thanks a lot, Patrice. You are amazing. Any chances of updating this to https://hub.docker.com/r/grobid/ ? Actually, I wish to use https://grobid.readthedocs.io/en/latest/Deep-Learning-models/ too and I am on M1 Macbook Max (Apple Silicon). It would be too complex for me to install everything. |
Thanks a lot! Patrice, It works just great! |
Patrice is it possible for you to update this on https://hub.docker.com/r/grobid/grobid/ |
@ayhama16 @suyogricha hello ! The docker images have been updated with the fix. |
Using
processFulltextAssetDocument
we can download a zip with the XML and related image resources.The url in the XML seem to include a temporary sub-directory,
e.g.
QwmZmbXzY9.lxml_data/image-1.png
instead ofimage-1.png
Example document
003525v1
(bioRxiv 10k training), generates the following partial TEI XML.(The zip file doesn't contain any subdirectories)
Example command:
(I have only tried that with the
cloud.science-miner.com
instance)BTW the FAQ seem to be listing the API as deprecated. Not sure if that is still correct?
The text was updated successfully, but these errors were encountered: