Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdfalto error: Syntax Warning: Invalid entry in bfchar block in ToUnicode CMap #923

Closed
artturimatias opened this issue Jun 7, 2022 · 11 comments
Assignees
Labels
bug From Hemiptera and especially its suborder Heteroptera implemented The issue has been implemented pdfalto Issue related to pdfalto

Comments

@artturimatias
Copy link

artturimatias commented Jun 7, 2022

Hi,
I'm getting following error with certain pdf:

ERROR [2022-06-07 08:02:33,838] org.grobid.core.process.ProcessPdfToXml: pdfalto process finished with error code: 143. [/opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /opt/grobid/grobid-home/tmp/origin3690432459378499723.pdf, /opt/grobid/grobid-home/tmp/czDhswmAVc.lxml]
ERROR [2022-06-07 08:02:33,838] org.grobid.core.process.ProcessPdfToXml: pdfalto return message: 
Syntax Warning: Invalid entry in bfchar block in ToUnicode CMap
Syntax Warning: Invalid entry in bfchar block in ToUnicode CMap
... LOT of these lines

This is the problematic PDF:
https://jyx.jyu.fi/bitstream/handle/123456789/81469/978-951-39-9321-4_vaitos10062022.pdf?sequence=1&isAllowed=y
Its a dissertation with multiple articles in it.

I'm calling grobid with httpie like this:
http -f POST :8070/api/processReferences input@'./978-951-39-9321-4_vaitos10062022.pdf;type=application/pdf'

Same problem also happens via web UI.

OS: Debian 11
Grobid version: 0.7.1 (Docker image)

Any clues what might be causing this?

@artturimatias
Copy link
Author

I'm not sure how Grobid is calling pdfalto, but I just tried with plain pdfalto:
pdfalto file.pdf
It produces same syntax warnings but it successfully creates xml file(s).

@lfoppiano
Copy link
Collaborator

@artturimatias which image are you using? there are two grobid/grobid:0.7.1 and lfoppiano/grobid:0.7.1

@artturimatias
Copy link
Author

Sorry, didn't realise there was two different images. I was using lfoppiano/grobid:0.7.1. I tried grobid/grobid:0.7.1 and it gives same warnings and then time outs.


ERROR [2022-06-09 13:11:31,000] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! org.grobid.core.exceptions.GrobidException: [TIMEOUT] PDF to XML conversion timed out
! at org.grobid.core.document.DocumentSource.processPdfaltoServerMode(DocumentSource.java:238)
! at org.grobid.core.document.DocumentSource.pdfalto(DocumentSource.java:147)
! at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:64)
! at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:50)
! at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:42)
! at org.grobid.core.engines.CitationParser.processingReferenceSection(CitationParser.java:396)

@lfoppiano
Copy link
Collaborator

About your first message, could you share the whole log? Do you get a timeout at the end or it just fails?

For what concern the timeout mentioned after the paper is very long, with the default settings is likely to happen.
The timeout protects grobid by stalling a large processing with data in the order of thousend, hunded-thousend etc..

You can see more details in other similar issues: #642, #690.

Just notice that since the configuration file is not anymore a properties file you should modify the config YAML file (which I guess you mounted as volume when you ran docker) by increasing the timeout seconds:

  pdf:
    pdfalto:
      # path relative to the grobid-home path (e.g. grobid-home/pdfalto), you don't want to change this normally
      path: "pdfalto"
      # security for PDF parsing
      memoryLimitMb: 6096
      timeoutSec: 60

@lfoppiano lfoppiano added pdfalto Issue related to pdfalto docker labels Jun 10, 2022
@lfoppiano lfoppiano self-assigned this Jun 10, 2022
@kermitt2
Copy link
Owner

kermitt2 commented Jun 10, 2022

Hello !

I did a quick test and it's not related to docker. It seems that there is an issue in the pdfalto_server script to get properly the child process termination signal (the process appears IDLE) with this PDF. So the timeout in pdfalto_server is doing its job and stop waiting for the process to finish (the timeout defined in pdfalto_server is triggered here at 20 seconds, it's not related to the timeout in Grobid config).

I don't know what is the reason for this and it would require more exploration.

The GROBID batch command does not launch pdfalo in another external sub-process and it will work:

cp /home/lopez/Downloads/978-951-39-9321-4_vaitos10062022.pdf ~/tmp/test101
java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.7.2-SNAPSHOT-onejar.jar -gH grobid-home -dIn /home/lopez/tmp/test101/ -dOut /home/lopez/tmp/test101/ -exe processFullText
ll ~/tmp/test101/
total 6.9M
drwxrwxr-x  3 lopez lopez 4.0K Jun 10 06:50 ./
drwxrwxr-x 83 lopez lopez 4.0K Jun 10 06:48 ../
drwxrwxr-x  2 lopez lopez 4.0K Jun 10 06:49 978-951-39-9321-4_vaitos10062022_assets/
-rw-rw-r--  1 lopez lopez 6.4M Jun 10 06:49 978-951-39-9321-4_vaitos10062022.pdf
-rw-rw-r--  1 lopez lopez 424K Jun 10 06:50 978-951-39-9321-4_vaitos10062022.tei.xml

978-951-39-9321-4_vaitos10062022.tei.xml is the GROBID result.

Note that thesis are not supported by current GROBID models (which are covering articles, book chapters, ...) so the header metadata in particular are not correct.

@artturimatias
Copy link
Author

@lfoppiano here is the rest of error message:

ERROR [2022-06-09 13:11:31,000] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! org.grobid.core.exceptions.GrobidException: [TIMEOUT] PDF to XML conversion timed out
! at org.grobid.core.document.DocumentSource.processPdfaltoServerMode(DocumentSource.java:238)
! at org.grobid.core.document.DocumentSource.pdfalto(DocumentSource.java:147)
! at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:64)
! at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:50)
! at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:42)
! at org.grobid.core.engines.CitationParser.processingReferenceSection(CitationParser.java:396)
! at org.grobid.core.engines.Engine.processReferences(Engine.java:254)
! at org.grobid.service.process.GrobidRestProcessFiles.processStatelessReferencesDocument(GrobidRestProcessFiles.java:550)
! at org.grobid.service.GrobidRestService.processStatelessReferencesDocumentReturnXml_post(GrobidRestService.java:679)
! at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
! at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
! at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.lang.reflect.Method.invoke(Method.java:498)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
! at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
! at io.dropwizard.jetty.NonblockingServletHolder.handle(NonblockingServletHolder.java:49)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:35)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:45)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:39)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:311)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:265)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
! at com.codahale.metrics.jetty9.InstrumentedHandler.handle(InstrumentedHandler.java:239)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:703)
! at io.dropwizard.jetty.BiDiGzipHandler.handle(BiDiGzipHandler.java:67)
! at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:56)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
! at org.eclipse.jetty.server.Server.handle(Server.java:505)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
! at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
! at java.lang.Thread.run(Thread.java:748)
10.0.2.100 - - [09/Jun/2022:13:11:31 +0000] "POST /api/processReferences HTTP/1.1" 500 41 "-" "HTTPie/2.6.0" 33432

@kermitt2 Yes, the paper is very long and I just found out that grobid is not trained for dissertations (maybe docs should mentions this?)

I'll try some heuristics with pdfalto and see what I can get. Maybe I extract only references section from disserations and then train grobid with these mini-PDF.

Thanks for the help!
I'll close this since pdfalto is doing its job in spite of warnings and dissertations are currently out of scope for stock grobid.

@kermitt2
Copy link
Owner

grobid is not trained for dissertations (maybe docs should mentions this?)

yes you're right for the doc!

I re-open the issue also to keep track of the problem with pdfalto_server not properly catching the termination signal (it should work correctly, as for the batch command).

Thank you @artturimatias for reporting the issue !

@lfoppiano lfoppiano added bug From Hemiptera and especially its suborder Heteroptera and removed docker labels Jun 13, 2022
kermitt2 added a commit that referenced this issue Jun 21, 2022
@kermitt2
Copy link
Owner

I re-open the issue also to keep track of the problem with pdfalto_server not properly catching the termination signal (it should work correctly, as for the batch command).

Fix with e06acf4
The redirected stderr from pdfalto coming now from pdfalto_server was not "gobbled" by the java ProcessBuilder call.

The PDF now work fine in server mode too.

@kermitt2 kermitt2 added the implemented The issue has been implemented label Jun 21, 2022
@lfoppiano
Copy link
Collaborator

@artturimatias could you please test the latest master? It should work fine and you could tune the timeout directly in the configuration file.

Important note, as mentioned by Patrice in the pull request comment:

Note that it breaks compatibility of grobid-home version 0.7.1 when running version 0.7.2-SNAPSHOT, grobid-home will need to be up-to-date too, but there's no other way to make that happen.

@artturimatias
Copy link
Author

I pulled the master and build with Dockerfile.crf and everything seems to work fine. Thanks for the fix!

@lfoppiano
Copy link
Collaborator

I close this issue. Should anything arise please feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera implemented The issue has been implemented pdfalto Issue related to pdfalto
Projects
None yet
Development

No branches or pull requests

3 participants