-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NumberFormatException #339
Comments
Hello @dbaehrens-averbis ! Thanks for reporting the error. This number is outside the Java integer range (32 bit), so the exception... For some reason the number looks like a concatenation of 2 numbers as strings, but it's hard to know why. The content might be present but together with some "sugar" between the digits maybe, which are cleaned in a pre-process. I think to reproduce the error, I would need the patent full text that raised this error. I don't see the patent publication number in the XMI file (but I don't know this format). It's very easy to fix, we just need to capture the exception and consider that the number is not well-formed and continue the processing. I can fix it, but an error case to validate the correction would be helpful. |
Hello Patrice,
thanks for your quick reply!
The full text is contained in the XMI output in attribute sofaString of
the tag cas:Sofa.
For your vaildation I also attach the example in original XML format
from EPO patent repository.
Best regards,
David
* Patrice Lopez on 2018-08-21 02:22:
…
Hello @dbaehrens-averbis <https://github.com/dbaehrens-averbis> !
Thanks for reporting the error. This number is outside the Java
integer range (32 bit), so the exception... For some reason the number
looks like a concatenation of 2 numbers as strings, but it's hard to
know why. The content might be present but together with some "sugar"
between the digits maybe, which are cleaned in a pre-process.
I think to reproduce the error, I would need the patent full text that
raised this error. I don't see the patent publication number in the
XMI file (but I don't know this format).
It's very easy to fix, we just need to capture the exception and
consider that the number is not well-formed and continue the
processing. I can fix it, but an error case to validate the correction
would be helpful.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#339 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AokL20XWNwFlDYSkre2VOlJ-pvCEao5iks5uS1KugaJpZM4WDoMq>.
|
It seems that the attachment was removed from my last comment via email.
|
Hi David, Looking at it now, I have to say that I was a bit too optimistic with all the With commit 639cb4e, I have added some checks and the exception should be always caught/logged so that the process can continue. After tests, it's working fine now. As for the error, it happens for charming patterns like this:
This is really weird as patent citation and it take the sequence Given that I was having my nose on it, I also updated the interpretation of US prefix "serial code" for year 2016. |
Dear Grobid developers, do you already have a release plan for a new grobid release that contains the fix for this issue? Thanks in advance, Christian Gaege (Averbis GmbH) |
HI @cgaege, the version 0.5.2 has been released. Let us know if the issue can be closed. |
@cgaege can we close this issue? |
We are currently verifying the new grobid release on our side. This may take a couple of days. But feel free to close the issue. Thank you very much |
@cgaege feel free to reopen or comment |
Former-commit-id: 639cb4e
Dear Grobid developers,
as an integrator, we use the Grobid annotator component in our text mining platform running at the European Patent Office (EPO) to process patent and literature references in over 100 million patent publications.
For some patent documents from the repository at EPO Grobid cannot finish the processing and throws a NumberFormatException. The numbers mentioned in the exceptions do not seem to be in the content of the documents, though.
Below, I attach an example where the error occurs in UIMA XMI format (zipped).
378875770.zip
Please tell me, if you can reproduce the issue with the example or in case you need more information to fix the exception in Grobid.
Thank you very much,
David
--
David Baehrens
Project Manager Averbis GmbH
Freiburg, Germany
The text was updated successfully, but these errors were encountered: