Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Create a backend to transform PubMed XML files to DoclingDocument #557

Merged
merged 1 commit into from
Dec 17, 2024

Conversation

lucas-morin
Copy link
Contributor

@lucas-morin lucas-morin commented Dec 10, 2024

feat: Create a backend to transform PubMed XML files to DoclingDocument

  • Introduce PubMedDocumentBackend to (1) parse elements from PubMed XML files and to (2) convert them to a DoclingDocument. Convert to DoclingDocument the authors, the abstract, the main-body text, the tables, the tables captions, the figures captions and the references. The hierarchy of the main-body text is preserved.
  • Add end-to-end tests for PubMed XML conversion (./tests/test_backend_xml.py).

Known limitations:

  • Compared to the original PDF documents, some sections are missing: page footers, page headers, figures, sections outside the main-body (for instance: Authors’ Contribution, Funding, Conflict of Interest, Acknowledgment, or Keywords).
  • Column headers for tables with multi-column headers are duplicated.
  • Some PubMed XML tables (stored as HTML) are occasionally not properly converted to Docling tables and then ignored.

Issue resolved by this Pull Request:
Resolves #446 (Partially)

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@lucas-morin lucas-morin self-assigned this Dec 10, 2024
Copy link

mergify bot commented Dec 10, 2024

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
docling/datamodel/base_models.py Outdated Show resolved Hide resolved
docling/backend/pubmed_backend.py Outdated Show resolved Hide resolved
docling/backend/pubmed_backend.py Outdated Show resolved Hide resolved
docling/backend/pubmed_backend.py Outdated Show resolved Hide resolved
@lucas-morin lucas-morin force-pushed the dev/pubmed-xml-backend branch 2 times, most recently from 9e03088 to b7696c3 Compare December 12, 2024 13:41
@ceberam
Copy link
Contributor

ceberam commented Dec 13, 2024

Some additional comments:

  • Rename the input format objects from PUBMED to XML_PUBMED for verity clarity (first the type, then the collection)
  • rename test files from nxml to xml (since the .xml is the only extension for PubMed, if InputFormat object is not provided)
  • remove opencv-python = "^4.10.0.84" from pyproject.toml. It's already a dependency of rapidocr-onnxruntime.
  • rebase to latest main
  • style suggestions (but it's up to you):
    • no need to import Dict, List, .... from typing library, you can use the built-in types dict, list, ... respectively
    • you can shorten expressions like if len(reference_node.xpath("article-title")) == 0 by using falsy and truthy values: if not reference_node.xpath("article-title"), and instead of if reference["author_names"] != "" you can use if reference["author_names"]

@lucas-morin lucas-morin force-pushed the dev/pubmed-xml-backend branch 2 times, most recently from 8b68d01 to 4b02895 Compare December 13, 2024 14:29
@lucas-morin lucas-morin force-pushed the dev/pubmed-xml-backend branch from 4b02895 to 4044515 Compare December 13, 2024 15:43
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to understand how this is affecting the backend/type selection:

  1. What happens if we input is a .xml file not matching the PubMed schema?
  2. What happens if the input is .nxml instead of .xml?

@dolfim-ibm
Copy link
Contributor

I think we should switch all function typing to Pydantic BaseModel.

@lucas-morin
Copy link
Contributor Author

  1. If we input a .xml file which do not match the JATS schema, docling raises an error:
docling.exceptions.ConversionError: File format not allowed: /home/lum/Documents/public/docling/tests/data/pubmed/test.xml
  1. If we rename one of the test documents from .xml to .nxml, the conversion remains the same.

@dolfim-ibm
Copy link
Contributor

  1. If we input a .xml file which do not match the JATS schema, docling raises an error:
docling.exceptions.ConversionError: File format not allowed: /home/lum/Documents/public/docling/tests/data/pubmed/test.xml
2. If we rename one of the test documents from `.xml` to `.nxml`, the conversion remains the same.

This is good. My last question on this topic, do we have already a way for a different XML format?

@ceberam
Copy link
Contributor

ceberam commented Dec 16, 2024

This is good. My last question on this topic, do we have already a way for a different XML format?

@dolfim-ibm This is addressed in #606 , which should be rebased (and therefore slightly refactored) once this PR #557 is merged.

@lucas-morin as discussed, let's give a more readable and type-safe schema to the parsed content instead of generic dictionaries (dict[str, Any]) by leveraging pydantic models. Also, we should rebase to pick-up the commits from last week.

@lucas-morin
Copy link
Contributor Author

Now, I improved the typing by removing Any types in dictionaries. I used TypedDict instead of Pydantic BaseModel, is it also fine?

@lucas-morin lucas-morin force-pushed the dev/pubmed-xml-backend branch from ebeffec to 2903490 Compare December 16, 2024 16:54
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's group all xml-related backends together in a subfolder,

from docling.backend.pubmed_backend import PubMedDocumentBackend

should become,

from docling.backend.xml.pubmed_backend import PubMedDocumentBackend

docling/backend/pubmed_backend.py Outdated Show resolved Hide resolved
docling/backend/pubmed_backend.py Outdated Show resolved Hide resolved
@lucas-morin lucas-morin force-pushed the dev/pubmed-xml-backend branch 3 times, most recently from c488cd7 to 22733c0 Compare December 17, 2024 17:18
Signed-off-by: lucas-morin <lucas.morin222@gmail.com>
@lucas-morin lucas-morin force-pushed the dev/pubmed-xml-backend branch from 22733c0 to 2490a09 Compare December 17, 2024 17:46
Copy link
Contributor

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏆

@lucas-morin lucas-morin dismissed stale reviews from PeterStaar-IBM and dolfim-ibm December 17, 2024 18:25

The code was updated accordingly.

@lucas-morin lucas-morin merged commit fd03480 into DS4SD:main Dec 17, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a backend to transform XML files to DoclingDocument
5 participants