Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text extraction with Docling #814

Closed
davidmezzetti opened this issue Nov 20, 2024 · 4 comments
Closed

Add text extraction with Docling #814

davidmezzetti opened this issue Nov 20, 2024 · 4 comments
Assignees
Milestone

Comments

@davidmezzetti
Copy link
Member

davidmezzetti commented Nov 20, 2024

Docling looks like a promising text extraction library that could possibly augment or replace Apache Tika.

Update: Docling added 3.9 support, this is a go!
The main integration issue is that it only supports Python 3.10+.

@davidmezzetti davidmezzetti self-assigned this Nov 20, 2024
@davidmezzetti davidmezzetti added this to the v8.1.0 milestone Nov 20, 2024
@yukiman76
Copy link
Contributor

I started to look at this as I think it would be a great addition, but there seams to be an Library incompatibility,

Here is some test code to show the Problem

import logging
from txtai import Embeddings # remove this line and this will run 
# from chonkie import WordChunker, SemanticChunker, SentenceChunker, SDPMChunker
from docling.document_converter import DocumentConverter

# Configure logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

# Create console handler with a debug level
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.DEBUG)

# Create formatter and add it to the handler
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
console_handler.setFormatter(formatter)

# Add the handler to the logger
logger.addHandler(console_handler)

# Suppress lower-priority logs from the "docling" library
# logging.getLogger("docling").setLevel(logging.DEBUG)
# Configure logging for "docling"
docling_logger = logging.getLogger("docling")
docling_logger.setLevel(logging.DEBUG)  # Adjust level as needed (DEBUG, INFO, WARNING)
docling_logger.addHandler(console_handler)  # Attach the same console handler
docling_logger.propagate = True  # Ensure messages propagate to the root logger


def process_docling(sfile):
    markdown_content = ""
    try:
        logger.info("@@@@" * 10)
        logger.info(f"Processing file: {sfile}")
        docling_converter = DocumentConverter()
        logger.info("Converting file...")
        result = docling_converter.convert(sfile)
        if result is not None:
            logger.info("Exporting to markdown...")
            markdown_content = result.document.export_to_markdown()
    except Exception as e:
        logger.error(f"An error occurred: {e}")

    return markdown_content

if __name__ == "__main__":
    logger.info("Starting the script...")
    file_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
    logger.info(f"Input file: {file_url}")
    output = process_docling(file_url)
    logger.info(f"Output:\n{output}")

You get a Crash when you import txtai lib

segmentation fault  python test.py
# multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

@davidmezzetti
Copy link
Member Author

Thanks for the report!

Given you are using macOS, I wonder if this is related to Faiss. Usually setting OMP_NUM_THREADS=1 is the one that works.

Just mentioned this a couple days ago: #813

I wonder if setting faiss.omp_set_num_threads(1) can programmatically solve this for mac's.

@yukiman76
Copy link
Contributor

running OMP_NUM_THREADS=1 python test.py
worked
But it seams was part of the upsert not search

@davidmezzetti
Copy link
Member Author

I'm thinking I'm going to add faiss.omp_set_num_threads(1) to ann/faiss.py. I feel like (macos) users could be lost due to this nagging issue. It would be good to try something.

@davidmezzetti davidmezzetti changed the title Investigate integration with Docling Add text extraction with Docling Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants