-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pdfalto error: Syntax Warning: Invalid entry in bfchar block in ToUnicode CMap #923
Comments
I'm not sure how Grobid is calling pdfalto, but I just tried with plain pdfalto: |
@artturimatias which image are you using? there are two |
Sorry, didn't realise there was two different images. I was using lfoppiano/grobid:0.7.1. I tried grobid/grobid:0.7.1 and it gives same warnings and then time outs.
|
About your first message, could you share the whole log? Do you get a timeout at the end or it just fails? For what concern the timeout mentioned after the paper is very long, with the default settings is likely to happen. You can see more details in other similar issues: #642, #690. Just notice that since the configuration file is not anymore a properties file you should modify the config YAML file (which I guess you mounted as volume when you ran docker) by increasing the timeout seconds: pdf:
pdfalto:
# path relative to the grobid-home path (e.g. grobid-home/pdfalto), you don't want to change this normally
path: "pdfalto"
# security for PDF parsing
memoryLimitMb: 6096
timeoutSec: 60 |
Hello ! I did a quick test and it's not related to docker. It seems that there is an issue in the I don't know what is the reason for this and it would require more exploration. The GROBID batch command does not launch pdfalo in another external sub-process and it will work:
Note that thesis are not supported by current GROBID models (which are covering articles, book chapters, ...) so the header metadata in particular are not correct. |
@lfoppiano here is the rest of error message:
@kermitt2 Yes, the paper is very long and I just found out that grobid is not trained for dissertations (maybe docs should mentions this?) I'll try some heuristics with pdfalto and see what I can get. Maybe I extract only references section from disserations and then train grobid with these mini-PDF. Thanks for the help! |
yes you're right for the doc! I re-open the issue also to keep track of the problem with Thank you @artturimatias for reporting the issue ! |
Fix with e06acf4 The PDF now work fine in server mode too. |
@artturimatias could you please test the latest master? It should work fine and you could tune the timeout directly in the configuration file. Important note, as mentioned by Patrice in the pull request comment:
|
I pulled the master and build with Dockerfile.crf and everything seems to work fine. Thanks for the fix! |
I close this issue. Should anything arise please feel free to reopen. |
Hi,
I'm getting following error with certain pdf:
This is the problematic PDF:
https://jyx.jyu.fi/bitstream/handle/123456789/81469/978-951-39-9321-4_vaitos10062022.pdf?sequence=1&isAllowed=y
Its a dissertation with multiple articles in it.
I'm calling grobid with httpie like this:
http -f POST :8070/api/processReferences input@'./978-951-39-9321-4_vaitos10062022.pdf;type=application/pdf'
Same problem also happens via web UI.
OS: Debian 11
Grobid version: 0.7.1 (Docker image)
Any clues what might be causing this?
The text was updated successfully, but these errors were encountered: