Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumberFormatException #339

Closed
dbaehrens-averbis opened this issue Aug 20, 2018 · 9 comments
Closed

NumberFormatException #339

dbaehrens-averbis opened this issue Aug 20, 2018 · 9 comments
Assignees

Comments

@dbaehrens-averbis
Copy link

Dear Grobid developers,

as an integrator, we use the Grobid annotator component in our text mining platform running at the European Patent Office (EPO) to process patent and literature references in over 100 million patent publications.

For some patent documents from the repository at EPO Grobid cannot finish the processing and throws a NumberFormatException. The numbers mentioned in the exceptions do not seem to be in the content of the documents, though.

Below, I attach an example where the error occurs in UIMA XMI format (zipped).

378875770.zip

Please tell me, if you can reproduce the issue with the example or in case you need more information to fix the exception in Grobid.

Thank you very much,
David

--
David Baehrens
Project Manager Averbis GmbH
Freiburg, Germany

0:15:21,486 SEVERE [org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl] ([UIMA AS
ThreadPool 9] 31553bbd-fb6f-4e51-ba27-cecb7d8d8bfd Process Thread - 828) Exception occurred: org.apache.uima.analysis_engine.AnalysisEngineProcessException
at de.averbis.textanalysis.components.grobidannotator.GrobidAnnotator.process(GrobidAnnotator.java:164) [grobid-annotator-1.2.0.jar:]
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48) [uimaj-core-2.9.0.jar:2.9.0]
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:396) [uimaj-core-2.9.0.jar:2.9.0]
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:314) [uimaj-core-2.9.0.jar:2.9.0]
at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:570) [uimaj-core-2.9.0.jar:2.9.0]
at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:412) [uimaj-core-2.9.0.jar:2.9.0]
at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:344) [uimaj-core-2.9.0.jar:2.9.0]
at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:265) [uimaj-core-2.9.0.jar:2.9.0]
at org.apache.uima.aae.controller.PrimitiveAnalysisEngineController_impl.process(PrimitiveAnalysisEngineController_impl.java:813) [uimaj-as-core-2.9.0.jar:2.9.0]
at org.apache.uima.aae.handler.HandlerBase.invokeProcess(HandlerBase.java:121) [uimaj-as-core-2.9.0.jar:2.9.0]
at org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:572) [uimaj-as-core-2.9.0.jar:2.9.0]
at org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:1090) [uimaj-as-core-2.9.0.jar:2.9.0]
at org.apache.uima.aae.handler.input.MetadataRequestHandler_impl.handle(MetadataRequestHandler_impl.java:78) [uimaj-as-core-2.9.0.jar:2.9.0]
at org.apache.uima.adapter.jms.activemq.JmsInputChannel.onMessage(JmsInputChannel.java:731) [uimaj-as-activemq-2.9.0.jar:2.9.0]
at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:721) [spring-jms-4.3.2.RELEASE.jar:4.3.2.RELEASE]
at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:681) [spring-jms-4.3.2.RELEASE.jar:4.3.2.RELEASE]
at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:651) [spring-jms-4.3.2.RELEASE.jar:4.3.2.RELEASE]
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:319) [spring-jms-4.3.2.RELEASE.jar:4.3.2.RELEASE]
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:257) [spring-jms-4.3.2.RELEASE.jar:4.3.2.RELEASE]
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1166) [spring-jms-4.3.2.RELEASE.jar:4.3.2.RELEASE]
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1060) [spring-jms-4.3.2.RELEASE.jar:4.3.2.RELEASE]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_72]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_72]
at org.apache.uima.aae.UimaAsThreadFactory$1.run(UimaAsThreadFactory.java:132) [uimaj-as-core-2.9.0.jar:2.9.0]
at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_72] Caused by: org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occured while running Grobid.
at org.grobid.core.engines.patent.ReferenceExtractor.extractAllReferencesString(ReferenceExtractor.java:730) [grobid-core-0.4.4.jar:]
at org.grobid.core.engines.Engine.processAllCitationsInPatent(Engine.java:1053) [grobid-core-0.4.4.jar:]
at de.averbis.textanalysis.components.grobidannotator.GrobidAnnotator.process(GrobidAnnotator.java:162) [grobid-annotator-1.2.0.jar:]
... 24 more
Caused by: java.lang.NumberFormatException: For input string: "6801412249"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) [rt.jar:1.7.0_72]
at java.lang.Integer.parseInt(Integer.java:495) [rt.jar:1.7.0_72]
at java.lang.Integer.parseInt(Integer.java:527) [rt.jar:1.7.0_72]
at org.grobid.core.engines.patent.PatentRefParser.processRawRefText(PatentRefParser.java:843) [grobid-core-0.4.4.jar:]
at org.grobid.core.engines.patent.ReferenceExtractor.extractAllReferencesString(ReferenceExtractor.java:649) [grobid-core-0.4.4.jar:]
... 26 more

@kermitt2
Copy link
Owner

Hello @dbaehrens-averbis !

Thanks for reporting the error. This number is outside the Java integer range (32 bit), so the exception... For some reason the number looks like a concatenation of 2 numbers as strings, but it's hard to know why. The content might be present but together with some "sugar" between the digits maybe, which are cleaned in a pre-process.

I think to reproduce the error, I would need the patent full text that raised this error. I don't see the patent publication number in the XMI file (but I don't know this format).

It's very easy to fix, we just need to capture the exception and consider that the number is not well-formed and continue the processing. I can fix it, but an error case to validate the correction would be helpful.

@dbaehrens-averbis
Copy link
Author

dbaehrens-averbis commented Aug 21, 2018 via email

@dbaehrens-averbis
Copy link
Author

It seems that the attachment was removed from my last comment via email.

For your vaildation I also attach the example in original XML format
from EPO patent repository.

006271747_JPH0586974B2_Description_EN_378875770.zip

@kermitt2
Copy link
Owner

Hi David,

Looking at it now, I have to say that I was a bit too optimistic with all the Integer.parseInt() at the time I wrote this ;)

With commit 639cb4e, I have added some checks and the exception should be always caught/logged so that the process can continue. After tests, it's working fine now.

As for the error, it happens for charming patterns like this:

US Patent Application10/801,446, 10/801, and 429

This is really weird as patent citation and it take the sequence 10/801,446, 10/801 as one number, which is too long for a Java integer... Given the million of citations, this kind of errors will happen certainly many times... anyway process is now robust to such problem.

Given that I was having my nose on it, I also updated the interpretation of US prefix "serial code" for year 2016.

@cgaege
Copy link

cgaege commented Oct 9, 2018

Dear Grobid developers,

do you already have a release plan for a new grobid release that contains the fix for this issue?

Thanks in advance,

Christian Gaege (Averbis GmbH)

@lfoppiano
Copy link
Collaborator

HI @cgaege, the version 0.5.2 has been released. Let us know if the issue can be closed.

@lfoppiano
Copy link
Collaborator

@cgaege can we close this issue?

@cgaege
Copy link

cgaege commented Oct 25, 2018

We are currently verifying the new grobid release on our side. This may take a couple of days. But feel free to close the issue.

Thank you very much

@lfoppiano
Copy link
Collaborator

@cgaege feel free to reopen or comment

de-code pushed a commit to elifesciences/grobid that referenced this issue Nov 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants