-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunking Hierarchy Identification #287
Comments
+1 |
@Shubhamkumar782 Docling produces a DoclingDocument data-structure, which can be used in the HierarchicalChunker. I think this solves your problem, just load the HierarchicalChunker from docling-core and leverage the chunk method. Close the issue if this helps you! |
@PeterStaar-IBM I see the chunk method in the HierarchicalChunker from docling-core, but the level of section-header is from the dl_doc: DLDocument. And in this #77
|
@qianyue76 Yes, this is correct. We need to add a new feature to identify the level of the section-headers in PDF (for docx, html and pptx, this comes for free). Within the pdf, it is hard to identify the section-header level, but we have some ideas how to tackle them. Of course, if you want to collaborate on that, please let us know! |
@PeterStaar-IBM Sounds great, I'm also trying to figure out how to solve this problem now, collaboration might be a good way |
ok, Please look at the email address on the MAINTAINERS.md. We can set up a quick sync and discuss next steps! |
@PeterStaar-IBM I get it! Should I do something first? |
Just write us an email, and we will follow up! |
I have tried HierarchicalChunker but the thing is that still section heading is not getting captured. I am also trying to solve this problem, I tried to analyse font size and bold information of font but that is also not enough to identify if that will be section heading. For now I have written regex based chunking which is very efficient but it will work for my current use case documents only and it's patch work, it's tough to generalize. There are some possible ways with vision based models but I have not tried yet. I saw few examples only. If I find some ways I will too update. |
@Shubhamkumar782 we have ideas to solve this holistically, and are open to collaboration. Regex and other methods will only work for very specific use-cases, not bad but also not very satisfying. If you have a working regex, you could always update the DoclingDocument with section-headers where the level gets updated. |
Good to hear. Actually regex that I have will work for specific document so even if I update it in DoclingDocument, I will be not able to achieve what I want. |
I'm also motivated to solve this problem; not just for PDFs but also docx input. I used to use |
Question
I am working on a custom chunking method where I need to identify headings, subheadings, and child headings separately. Here's the detailed explanation:
Current Issue:
I am using Docling to tag headings in a PDF.
Currently, all nested headings (subheadings and child headings) are marked as regular headings with ##.
There is no differentiation between parent headings and sub-level headings.
Objective:
I want to store section headings and titles as metadata for the content under each subheading.
Example: For a PDF with 3 chapters, each having multiple subheadings, the chunk should have:
Chapter Name as the Title.
Subheading as the Section Heading.
Current Limitation:
While I can extract the lowest level of headings, I am unable to identify the parent headings since the tags do not differentiate between them.
Assumption for Hierarchy:
I assume that chapter names are typically larger in font size compared to subheadings and child headings.
A hierarchy based on text size or boldness could be useful to identify different levels of headings.
Question:
Is there a way to distinguish headings, subheadings, and child headings separately based on these characteristics (e.g., font size, boldness)?
Any solution or guidance to achieve this would be highly appreciated.
The text was updated successfully, but these errors were encountered: