-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid splitting URLs between sentences #1097
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…"postprocessed" text
This was referenced Apr 15, 2024
This issue was tested by processing all PMC and Biorxiv documents. No error or failures during processing. I also tested a bunch of problematic PDF documents. |
…rectly matching the real text (dehypenised)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses the issue of the sentence segmenter that might split URLs between sentences.
Updating the regex
urlPattern
is hard to do without high risk of introuducing new bugs (some experiments/attempts here)The original grobid method to exploit the URI pdf annotations, was extended to support cases where the layout token resulting text and the provided postprocessed text differs which was leading to OutOfBoundException.
We have added/modified the following methods:
public static List<OffsetPosition> characterPositionsUrlPatternWithPdfAnnotations(List<LayoutToken> layoutTokens, List<PDFAnnotation> pdfAnnotations)
returns the character offset position in respect of the layout token string (that could be obtained by LayoutTokenUtil.toText(tokens).public static List<OffsetPosition> tokenPositionsUrlPatternWithPdfAnnotations(List<LayoutToken> layoutTokens, List<PDFAnnotation> pdfAnnotations)
returns the token offset position.public static List<OffsetPosition> characterPositionsUrlPatternWithPdfAnnotations(List<LayoutToken> layoutTokens, List<PDFAnnotation> pdfAnnotations, String text)
returns the character offset position in respect of the text string that is passed in input.There are often cases where thetext
string and the aggregated string from the layoutToken are not matching (e.g. the text string is dehypenised), and this causes OutOfBoundException when applying substring.The last method (
characterPositionsUrlPatternWithPdfAnnotations(List<LayoutToken> layoutTokens, List<PDFAnnotation> pdfAnnotations, String text)
) is called when the sentence segmenter is running so that we avoid splitting sentences with a URL in between.The PR #1099 will improve the recognition because, in this PR, by applying the fix in the sentenceSegmenter that takes text as a string, the process is applied to the layout tokens and not to the text that might be dehypenised, and desynchronised with the layout tokens.