-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some incomplete coordinates for sentence elements #811
Comments
error case via https://github.com/DataSeer/dataseer-web/issues/441 |
It seems that the PDF is not reachable 😺 |
Sorry poor internet connection :( |
Normally the text to be segmented includes the references (all text including descendant elements):
and we only keep track of the positions of the references to pass the "forbidden positions" to the segmenter:
It seems that until that step, it works fine, the texts of the sentences look good. The problem is probably then we try to group the LayoutToken corresponding of each sentence in |
PR #821 fixes the problem, which was due to a leftover in the reference pattern (year pattern) missing in the XML. All the coordinates for sentence elements now look good:
|
Also good for https://github.com/DataSeer/dataseer-web/issues/461 |
For this example (preprint):
Uploading document_sentence_segmentation_issues.pdf…
we have some incomplete bounding boxes for coordinates at sentence-level, see the 5 last sentences of this paragraph:
The text was updated successfully, but these errors were encountered: