Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking Hierarchy Identification #287

Open
Shubhamkumar782 opened this issue Nov 9, 2024 · 12 comments
Open

Chunking Hierarchy Identification #287

Shubhamkumar782 opened this issue Nov 9, 2024 · 12 comments
Assignees
Labels
PDF parsing question Further information is requested

Comments

@Shubhamkumar782
Copy link

Question
I am working on a custom chunking method where I need to identify headings, subheadings, and child headings separately. Here's the detailed explanation:

Current Issue:

I am using Docling to tag headings in a PDF.
Currently, all nested headings (subheadings and child headings) are marked as regular headings with ##.
There is no differentiation between parent headings and sub-level headings.
Objective:

I want to store section headings and titles as metadata for the content under each subheading.
Example: For a PDF with 3 chapters, each having multiple subheadings, the chunk should have:
Chapter Name as the Title.
Subheading as the Section Heading.
Current Limitation:

While I can extract the lowest level of headings, I am unable to identify the parent headings since the tags do not differentiate between them.
Assumption for Hierarchy:

I assume that chapter names are typically larger in font size compared to subheadings and child headings.
A hierarchy based on text size or boldness could be useful to identify different levels of headings.
Question:

Is there a way to distinguish headings, subheadings, and child headings separately based on these characteristics (e.g., font size, boldness)?
Any solution or guidance to achieve this would be highly appreciated.

@Shubhamkumar782 Shubhamkumar782 added the question Further information is requested label Nov 9, 2024
@PeterStaar-IBM PeterStaar-IBM self-assigned this Nov 11, 2024
@AlessandroSpallina
Copy link

+1

@PeterStaar-IBM
Copy link
Contributor

PeterStaar-IBM commented Nov 13, 2024

@Shubhamkumar782 Docling produces a DoclingDocument data-structure, which can be used in the HierarchicalChunker.

I think this solves your problem, just load the HierarchicalChunker from docling-core and leverage the chunk method.

Close the issue if this helps you!

@qianyue76
Copy link

@PeterStaar-IBM I see the chunk method in the HierarchicalChunker from docling-core, but the level of section-header is from the dl_doc: DLDocument. And in this #77

At the moment, the system is detecting only one level of section headers.
Is it supported deeper level of section headers now? Or I missed something?

@PeterStaar-IBM
Copy link
Contributor

@qianyue76 Yes, this is correct. We need to add a new feature to identify the level of the section-headers in PDF (for docx, html and pptx, this comes for free). Within the pdf, it is hard to identify the section-header level, but we have some ideas how to tackle them.

Of course, if you want to collaborate on that, please let us know!

@qianyue76
Copy link

@PeterStaar-IBM Sounds great, I'm also trying to figure out how to solve this problem now, collaboration might be a good way

@PeterStaar-IBM
Copy link
Contributor

ok, Please look at the email address on the MAINTAINERS.md. We can set up a quick sync and discuss next steps!

@qianyue76
Copy link

@PeterStaar-IBM I get it! Should I do something first?

@PeterStaar-IBM
Copy link
Contributor

Just write us an email, and we will follow up!

@Shubhamkumar782
Copy link
Author

@PeterStaar-IBM ,

I have tried HierarchicalChunker but the thing is that still section heading is not getting captured.

I am also trying to solve this problem, I tried to analyse font size and bold information of font but that is also not enough to identify if that will be section heading.
As sometimes for some document heading will be same size of usual text size and it is not bold as well.

For now I have written regex based chunking which is very efficient but it will work for my current use case documents only and it's patch work, it's tough to generalize.

There are some possible ways with vision based models but I have not tried yet. I saw few examples only.

If I find some ways I will too update.

@PeterStaar-IBM
Copy link
Contributor

@Shubhamkumar782 we have ideas to solve this holistically, and are open to collaboration. Regex and other methods will only work for very specific use-cases, not bad but also not very satisfying.

If you have a working regex, you could always update the DoclingDocument with section-headers where the level gets updated.

@Shubhamkumar782
Copy link
Author

@PeterStaar-IBM,

Good to hear.

Actually regex that I have will work for specific document so even if I update it in DoclingDocument, I will be not able to achieve what I want.

@JonZeolla
Copy link

I'm also motivated to solve this problem; not just for PDFs but also docx input. I used to use pandoc for this and it maintained the headings/hiearchy so I could use the langchain chunkers, but it would lose details such as if a heading was 1.4.b.15 Example. It would replace that with something like ##### Example, but in my use case I need the original information like ##### 1.4.b.15 Example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDF parsing question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants