-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Create a backend to transform PubMed XML files to DoclingDocument #557
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
9e03088
to
b7696c3
Compare
Some additional comments:
|
8b68d01
to
4b02895
Compare
4b02895
to
4044515
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to understand how this is affecting the backend/type selection:
- What happens if we input is a .xml file not matching the PubMed schema?
- What happens if the input is .nxml instead of .xml?
I think we should switch all function typing to Pydantic BaseModel. |
|
This is good. My last question on this topic, do we have already a way for a different XML format? |
@dolfim-ibm This is addressed in #606 , which should be rebased (and therefore slightly refactored) once this PR #557 is merged. @lucas-morin as discussed, let's give a more readable and type-safe schema to the parsed content instead of generic dictionaries ( |
4044515
to
ebeffec
Compare
Now, I improved the typing by removing |
ebeffec
to
2903490
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's group all xml-related backends together in a subfolder,
from docling.backend.pubmed_backend import PubMedDocumentBackend
should become,
from docling.backend.xml.pubmed_backend import PubMedDocumentBackend
c488cd7
to
22733c0
Compare
Signed-off-by: lucas-morin <lucas.morin222@gmail.com>
22733c0
to
2490a09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🏆
The code was updated accordingly.
feat: Create a backend to transform PubMed XML files to DoclingDocument
PubMedDocumentBackend
to (1) parse elements from PubMed XML files and to (2) convert them to aDoclingDocument
. Convert toDoclingDocument
the authors, the abstract, the main-body text, the tables, the tables captions, the figures captions and the references. The hierarchy of the main-body text is preserved../tests/test_backend_xml.py
).Known limitations:
Authors’ Contribution
,Funding
,Conflict of Interest
,Acknowledgment
, orKeywords
).Issue resolved by this Pull Request:
Resolves #446 (Partially)
Checklist: