Avoid splitting URLs between sentences #1097

lfoppiano · 2024-04-12T01:02:20Z

This PR addresses the issue of the sentence segmenter that might split URLs between sentences.
Updating the regex urlPattern is hard to do without high risk of introuducing new bugs (some experiments/attempts here)

The original grobid method to exploit the URI pdf annotations, was extended to support cases where the layout token resulting text and the provided postprocessed text differs which was leading to OutOfBoundException.

We have added/modified the following methods:

new method public static List<OffsetPosition> characterPositionsUrlPatternWithPdfAnnotations(List<LayoutToken> layoutTokens, List<PDFAnnotation> pdfAnnotations) returns the character offset position in respect of the layout token string (that could be obtained by LayoutTokenUtil.toText(tokens).
new method public static List<OffsetPosition> tokenPositionsUrlPatternWithPdfAnnotations(List<LayoutToken> layoutTokens, List<PDFAnnotation> pdfAnnotations) returns the token offset position.
modified the original public static List<OffsetPosition> characterPositionsUrlPatternWithPdfAnnotations(List<LayoutToken> layoutTokens, List<PDFAnnotation> pdfAnnotations, String text) returns the character offset position in respect of the text string that is passed in input.

There are often cases where the text string and the aggregated string from the layoutToken are not matching (e.g. the text string is dehypenised), and this causes OutOfBoundException when applying substring.

The last method (characterPositionsUrlPatternWithPdfAnnotations(List<LayoutToken> layoutTokens, List<PDFAnnotation> pdfAnnotations, String text)) is called when the sentence segmenter is running so that we avoid splitting sentences with a URL in between.

The PR #1099 will improve the recognition because, in this PR, by applying the fix in the sentenceSegmenter that takes text as a string, the process is applied to the layout tokens and not to the text that might be dehypenised, and desynchronised with the layout tokens.

…nter

…"postprocessed" text

coveralls · 2024-04-12T01:06:19Z

coverage: 40.116% (+0.2%) from 39.924%
when pulling 5bcb8b1 on feature/preserve-urls
into 83f2c81 on master.

lfoppiano · 2024-05-07T22:47:03Z

This issue was tested by processing all PMC and Biorxiv documents. No error or failures during processing.
Then I inspected the URLs with regexes and verified that no URLs were over sentences.

I also tested a bunch of problematic PDF documents.

…rectly matching the real text (dehypenised)

lfoppiano added 7 commits April 9, 2024 07:59

add URL detection to avoid split them when running the sentence segme…

9d9455a

…nter

update lexicon and add more integration tests

cff8138

typos

dcda0dc

Add test

a3cc84e

improvements

ddd9336

add method to match the offset from the layout token raw string to a …

ca3c352

…"postprocessed" text

Use a lexicon normal test for static methods

fbbf254

fix consistency in method names

621d1da

lfoppiano marked this pull request as ready for review April 12, 2024 01:18

lfoppiano added the enhancement label Apr 12, 2024

Update tests

9607391

This was referenced Apr 15, 2024

Funding, acknowledgement statements are not split into sentences #1090

Closed

Identify URLs and output them in TEI #1099

Merged

lfoppiano added 3 commits April 17, 2024 08:30

keep convention on the token/character calculation

6ff15ee

update test to follow the convention

3900dc2

get fixes on matchTokenAndString from PR #1099

ec52f13

lfoppiano and others added 4 commits May 8, 2024 10:24

Merge branch 'master' into feature/preserve-urls

322bf23

Add additional test and fix to the method so that the offsets are cor…

f983f25

…rectly matching the real text (dehypenised)

Apply url preservation also in tables description and notes

617aa16

Merge branch 'master' into feature/preserve-urls

5bcb8b1

lfoppiano mentioned this pull request May 9, 2024

processFulltextDocument fails on 0.23% arXiv PDFs #1113

Closed

lfoppiano added this to the 0.8.1 milestone May 21, 2024

lfoppiano merged commit 76fd16f into master Jun 9, 2024
9 checks passed

lfoppiano deleted the feature/preserve-urls branch June 9, 2024 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid splitting URLs between sentences #1097

Avoid splitting URLs between sentences #1097

lfoppiano commented Apr 12, 2024 •

edited

Loading

coveralls commented Apr 12, 2024 •

edited

Loading

lfoppiano commented May 7, 2024

Avoid splitting URLs between sentences #1097

Avoid splitting URLs between sentences #1097

Conversation

lfoppiano commented Apr 12, 2024 • edited Loading

coveralls commented Apr 12, 2024 • edited Loading

lfoppiano commented May 7, 2024

lfoppiano commented Apr 12, 2024 •

edited

Loading

coveralls commented Apr 12, 2024 •

edited

Loading