Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processFulltextAssetDocument graphic url incorrectly including temporary (sub-)directory #836

Closed
de-code opened this issue Sep 17, 2021 · 10 comments
Labels
bug From Hemiptera and especially its suborder Heteroptera implemented The issue has been implemented

Comments

@de-code
Copy link
Collaborator

de-code commented Sep 17, 2021

Using processFulltextAssetDocument we can download a zip with the XML and related image resources.

The url in the XML seem to include a temporary sub-directory,
e.g. QwmZmbXzY9.lxml_data/image-1.png instead of image-1.png

Example document 003525v1 (bioRxiv 10k training), generates the following partial TEI XML.

<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Network-based analysis of omic data to model the processes connecting genetic variation to disease.</figDesc><graphic url="QwmZmbXzY9.lxml_data/image-1.png" coords="12,72.00,178.57,319.30,504.00" type="bitmap" /></figure>

(The zip file doesn't contain any subdirectories)

Example command:

curl -v \
  --output "003525v1.zip" \
  --form input=@003525v1.pdf \
  https://<host>:<port>/api/processFulltextAssetDocument

(I have only tried that with the cloud.science-miner.com instance)

BTW the FAQ seem to be listing the API as deprecated. Not sure if that is still correct?

@kermitt2 kermitt2 added the bug From Hemiptera and especially its suborder Heteroptera label Oct 18, 2021
@kermitt2
Copy link
Owner

kermitt2 commented Oct 18, 2021

Thank you @de-code !

I think this is due to a change I made in the last months.
This is deprecated indeed (it can lead to issues when there are thousands of embedded images), so I forgot to update the zip stuff. But I will try to have it still working while I don't have a better solution.

@de-code
Copy link
Collaborator Author

de-code commented Oct 19, 2021

This is deprecated indeed (it can lead to issues when there are thousands of embedded images)

Maybe that wouldn't be as much of an issue if you only include images that you also created graphic elements for?
(At least that is how I have implemented it, although I do currently include an unmatched_graphics note section mostly for debug purpose)

@officialsuyogdixit
Copy link

officialsuyogdixit commented Feb 2, 2022

Any update on this? I do not want to use pdfalto. I want to use the awesomeness of grobid only. But processFulltextAssetDocument gives incorrect temporary directory URLs (which I am correcting manually) and sometimes it includes images in the zip but misses them the TEI XML (main problem).

I am using the latest docker grobid/grobid:0.7.1-SNAPSHOT.

@kermitt2
Copy link
Owner

kermitt2 commented Feb 2, 2022

HI @suyogricha !

Thanks for feedback :)

If I remember well, I fixed the specific problem of sub-directory in branch https://github.com/kermitt2/grobid/tree/fix-vector-graphics - however this branch introduces some very significant changes in the way figures and table are detected and structured (we will start from every graphic elements) and introduces a new figure-segmenter model, so it will take several months to have this branch merged.

The problem of images in the zip not referenced in the TEI XML is a different problem, and it is the main purpose of this new branch, to improve the figure/table recognition. Also note that it's very common to have quite a lot of images and vector graphics not part of any figures and tables, and all the graphics won't be references anyway in the TEI XML.

@ayhama16
Copy link

Hi, I have tried the last version and the problem is still there.. "<graphic url="8u5Yhm3d6D.lxml_data/image-1.png" "
can you help me to fix it? Thank you so much!

kermitt2 added a commit that referenced this issue Feb 17, 2022
@kermitt2
Copy link
Owner

Hi @suyogricha and @ayhama16 !

I've added quickly the fix to the current master with 621f5a1.

<graphic url="image-2.png" coords="9,317.96,86.60,246.75,221.90" type="bitmap" />
$ unzip -l out.zip 
Archive:  out.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
    92804  2022-02-17 05:07   tei.xml
   196898  2022-02-17 05:07   image-2.png
   112218  2022-02-17 05:07   image-1.png
---------                     -------
   401920                     3 files

@officialsuyogdixit
Copy link

Thanks a lot, Patrice. You are amazing. Any chances of updating this to https://hub.docker.com/r/grobid/ ?

Actually, I wish to use https://grobid.readthedocs.io/en/latest/Deep-Learning-models/ too and I am on M1 Macbook Max (Apple Silicon). It would be too complex for me to install everything.

@ayhama16
Copy link

Thanks a lot! Patrice, It works just great!

@officialsuyogdixit
Copy link

Patrice is it possible for you to update this on https://hub.docker.com/r/grobid/grobid/

@kermitt2 kermitt2 added the implemented The issue has been implemented label Apr 16, 2022
@kermitt2
Copy link
Owner

@ayhama16 @suyogricha hello ! The docker images have been updated with the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera implemented The issue has been implemented
Projects
None yet
Development

No branches or pull requests

5 participants