feat: Create a backend to transform PubMed XML files to DoclingDocument #557

lucas-morin · 2024-12-10T14:57:45Z

feat: Create a backend to transform PubMed XML files to DoclingDocument

Introduce PubMedDocumentBackend to (1) parse elements from PubMed XML files and to (2) convert them to a DoclingDocument. Convert to DoclingDocument the authors, the abstract, the main-body text, the tables, the tables captions, the figures captions and the references. The hierarchy of the main-body text is preserved.
Add end-to-end tests for PubMed XML conversion (./tests/test_backend_xml.py).

Known limitations:

Compared to the original PDF documents, some sections are missing: page footers, page headers, figures, sections outside the main-body (for instance: Authors’ Contribution, Funding, Conflict of Interest, Acknowledgment, or Keywords).
Column headers for tables with multi-column headers are duplicated.
Some PubMed XML tables (stored as HTML) are occasionally not properly converted to Docling tables and then ignored.

Issue resolved by this Pull Request:
Resolves #446 (Partially)

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

mergify · 2024-12-10T14:59:03Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

pyproject.toml

docling/datamodel/base_models.py

docling/backend/pubmed_backend.py

tests/test_backend_pubmed.py

ceberam · 2024-12-13T13:10:58Z

Some additional comments:

Rename the input format objects from PUBMED to XML_PUBMED for verity clarity (first the type, then the collection)
rename test files from nxml to xml (since the .xml is the only extension for PubMed, if InputFormat object is not provided)
remove opencv-python = "^4.10.0.84" from pyproject.toml. It's already a dependency of rapidocr-onnxruntime.
rebase to latest main
style suggestions (but it's up to you):
- no need to import Dict, List, .... from typing library, you can use the built-in types dict, list, ... respectively
- you can shorten expressions like if len(reference_node.xpath("article-title")) == 0 by using falsy and truthy values: if not reference_node.xpath("article-title"), and instead of if reference["author_names"] != "" you can use if reference["author_names"]

docling/datamodel/document.py

dolfim-ibm

I would like to understand how this is affecting the backend/type selection:

What happens if we input is a .xml file not matching the PubMed schema?
What happens if the input is .nxml instead of .xml?

dolfim-ibm · 2024-12-16T09:11:07Z

I think we should switch all function typing to Pydantic BaseModel.

lucas-morin · 2024-12-16T09:14:09Z

If we input a .xml file which do not match the JATS schema, docling raises an error:

docling.exceptions.ConversionError: File format not allowed: /home/lum/Documents/public/docling/tests/data/pubmed/test.xml

If we rename one of the test documents from .xml to .nxml, the conversion remains the same.

dolfim-ibm · 2024-12-16T09:23:47Z

If we input a .xml file which do not match the JATS schema, docling raises an error:

docling.exceptions.ConversionError: File format not allowed: /home/lum/Documents/public/docling/tests/data/pubmed/test.xml

2. If we rename one of the test documents from `.xml` to `.nxml`, the conversion remains the same.

This is good. My last question on this topic, do we have already a way for a different XML format?

ceberam · 2024-12-16T09:47:29Z

This is good. My last question on this topic, do we have already a way for a different XML format?

@dolfim-ibm This is addressed in #606 , which should be rebased (and therefore slightly refactored) once this PR #557 is merged.

@lucas-morin as discussed, let's give a more readable and type-safe schema to the parsed content instead of generic dictionaries (dict[str, Any]) by leveraging pydantic models. Also, we should rebase to pick-up the commits from last week.

lucas-morin · 2024-12-16T15:05:25Z

Now, I improved the typing by removing Any types in dictionaries. I used TypedDict instead of Pydantic BaseModel, is it also fine?

PeterStaar-IBM

let's group all xml-related backends together in a subfolder,

from docling.backend.pubmed_backend import PubMedDocumentBackend

should become,

from docling.backend.xml.pubmed_backend import PubMedDocumentBackend

docling/backend/pubmed_backend.py

docling/datamodel/base_models.py

docling/datamodel/document.py

docling/document_converter.py

tests/test_backend_pubmed.py

Signed-off-by: lucas-morin <lucas.morin222@gmail.com>

ceberam

🏆

The code was updated accordingly.

lucas-morin requested review from ceberam and dolfim-ibm December 10, 2024 14:57

lucas-morin self-assigned this Dec 10, 2024

lucas-morin mentioned this pull request Dec 10, 2024

feat: Create a backend to transform PubMed XML files to DoclingDocument #517

Closed

dolfim-ibm previously requested changes Dec 10, 2024

View reviewed changes

ceberam reviewed Dec 10, 2024

View reviewed changes

tests/test_backend_pubmed.py Outdated Show resolved Hide resolved

lucas-morin force-pushed the dev/pubmed-xml-backend branch 2 times, most recently from 9e03088 to b7696c3 Compare December 12, 2024 13:41

lucas-morin force-pushed the dev/pubmed-xml-backend branch 2 times, most recently from 8b68d01 to 4b02895 Compare December 13, 2024 14:29

ceberam reviewed Dec 13, 2024

View reviewed changes

docling/datamodel/document.py Outdated Show resolved Hide resolved

lucas-morin force-pushed the dev/pubmed-xml-backend branch from 4b02895 to 4044515 Compare December 13, 2024 15:43

dolfim-ibm reviewed Dec 16, 2024

View reviewed changes

ceberam mentioned this pull request Dec 16, 2024

feat: create a backend to parse USPTO patents into DoclingDocument #606

Merged

3 tasks

lucas-morin force-pushed the dev/pubmed-xml-backend branch from 4044515 to ebeffec Compare December 16, 2024 15:00

lucas-morin force-pushed the dev/pubmed-xml-backend branch from ebeffec to 2903490 Compare December 16, 2024 16:54

PeterStaar-IBM previously requested changes Dec 17, 2024

View reviewed changes

dolfim-ibm reviewed Dec 17, 2024

View reviewed changes

docling/backend/pubmed_backend.py Outdated Show resolved Hide resolved

docling/backend/pubmed_backend.py Outdated Show resolved Hide resolved

dolfim-ibm reviewed Dec 17, 2024

View reviewed changes

docling/datamodel/base_models.py Outdated Show resolved Hide resolved

lucas-morin force-pushed the dev/pubmed-xml-backend branch 3 times, most recently from c488cd7 to 22733c0 Compare December 17, 2024 17:18

ceberam reviewed Dec 17, 2024

View reviewed changes

docling/datamodel/document.py Outdated Show resolved Hide resolved

ceberam reviewed Dec 17, 2024

View reviewed changes

docling/document_converter.py Outdated Show resolved Hide resolved

ceberam reviewed Dec 17, 2024

View reviewed changes

tests/test_backend_pubmed.py Outdated Show resolved Hide resolved

feat: Create a backend to transform PubMed XML files to DoclingDocument

2490a09

Signed-off-by: lucas-morin <lucas.morin222@gmail.com>

lucas-morin force-pushed the dev/pubmed-xml-backend branch from 22733c0 to 2490a09 Compare December 17, 2024 17:46

ceberam approved these changes Dec 17, 2024

View reviewed changes

cau-git approved these changes Dec 17, 2024

View reviewed changes

lucas-morin merged commit fd03480 into DS4SD:main Dec 17, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Create a backend to transform PubMed XML files to DoclingDocument #557

feat: Create a backend to transform PubMed XML files to DoclingDocument #557

lucas-morin commented Dec 10, 2024 •

edited

Loading

mergify bot commented Dec 10, 2024 •

edited

Loading

ceberam commented Dec 13, 2024 •

edited

Loading

dolfim-ibm left a comment

dolfim-ibm commented Dec 16, 2024

lucas-morin commented Dec 16, 2024

dolfim-ibm commented Dec 16, 2024

ceberam commented Dec 16, 2024 •

edited

Loading

lucas-morin commented Dec 16, 2024

PeterStaar-IBM left a comment

ceberam left a comment

feat: Create a backend to transform PubMed XML files to DoclingDocument #557

feat: Create a backend to transform PubMed XML files to DoclingDocument #557

Conversation

lucas-morin commented Dec 10, 2024 • edited Loading

mergify bot commented Dec 10, 2024 • edited Loading

Merge Protections

🟢 Enforce conventional commit

🟢 Require two reviewer for test updates

ceberam commented Dec 13, 2024 • edited Loading

dolfim-ibm left a comment

Choose a reason for hiding this comment

dolfim-ibm commented Dec 16, 2024

lucas-morin commented Dec 16, 2024

dolfim-ibm commented Dec 16, 2024

ceberam commented Dec 16, 2024 • edited Loading

lucas-morin commented Dec 16, 2024

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

ceberam left a comment

Choose a reason for hiding this comment

lucas-morin commented Dec 10, 2024 •

edited

Loading

mergify bot commented Dec 10, 2024 •

edited

Loading

ceberam commented Dec 13, 2024 •

edited

Loading

ceberam commented Dec 16, 2024 •

edited

Loading