Add text extraction with Docling #814

davidmezzetti · 2024-11-20T02:05:09Z

Docling looks like a promising text extraction library that could possibly augment or replace Apache Tika.

Update: Docling added 3.9 support, this is a go!
~~The main integration issue is that it only supports Python 3.10+.~~

yukiman76 · 2024-11-21T19:03:46Z

I started to look at this as I think it would be a great addition, but there seams to be an Library incompatibility,

Here is some test code to show the Problem

import logging
from txtai import Embeddings # remove this line and this will run 
# from chonkie import WordChunker, SemanticChunker, SentenceChunker, SDPMChunker
from docling.document_converter import DocumentConverter

# Configure logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

# Create console handler with a debug level
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.DEBUG)

# Create formatter and add it to the handler
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
console_handler.setFormatter(formatter)

# Add the handler to the logger
logger.addHandler(console_handler)

# Suppress lower-priority logs from the "docling" library
# logging.getLogger("docling").setLevel(logging.DEBUG)
# Configure logging for "docling"
docling_logger = logging.getLogger("docling")
docling_logger.setLevel(logging.DEBUG)  # Adjust level as needed (DEBUG, INFO, WARNING)
docling_logger.addHandler(console_handler)  # Attach the same console handler
docling_logger.propagate = True  # Ensure messages propagate to the root logger


def process_docling(sfile):
    markdown_content = ""
    try:
        logger.info("@@@@" * 10)
        logger.info(f"Processing file: {sfile}")
        docling_converter = DocumentConverter()
        logger.info("Converting file...")
        result = docling_converter.convert(sfile)
        if result is not None:
            logger.info("Exporting to markdown...")
            markdown_content = result.document.export_to_markdown()
    except Exception as e:
        logger.error(f"An error occurred: {e}")

    return markdown_content

if __name__ == "__main__":
    logger.info("Starting the script...")
    file_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
    logger.info(f"Input file: {file_url}")
    output = process_docling(file_url)
    logger.info(f"Output:\n{output}")

You get a Crash when you import txtai lib

segmentation fault  python test.py
# multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

davidmezzetti · 2024-11-21T19:09:56Z

Thanks for the report!

Given you are using macOS, I wonder if this is related to Faiss. Usually setting OMP_NUM_THREADS=1 is the one that works.

Just mentioned this a couple days ago: #813

I wonder if setting faiss.omp_set_num_threads(1) can programmatically solve this for mac's.

yukiman76 · 2024-11-21T21:19:25Z

running OMP_NUM_THREADS=1 python test.py
worked
But it seams was part of the upsert not search

davidmezzetti · 2024-11-22T02:04:01Z

I'm thinking I'm going to add faiss.omp_set_num_threads(1) to ann/faiss.py. I feel like (macos) users could be lost due to this nagging issue. It would be good to try something.

davidmezzetti self-assigned this Nov 20, 2024

davidmezzetti added this to the v8.1.0 milestone Nov 20, 2024

davidmezzetti mentioned this issue Nov 22, 2024

Add programmatic workaround for Faiss + macOS #818

Merged

davidmezzetti changed the title ~~Investigate integration with Docling~~ Add text extraction with Docling Dec 3, 2024

davidmezzetti closed this as completed in 9efc9ba Dec 3, 2024

davidmezzetti added a commit that referenced this issue Dec 5, 2024

Add HTML normalization rules to Docling #814

26bd35d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add text extraction with Docling #814

Add text extraction with Docling #814

davidmezzetti commented Nov 20, 2024 •

edited

Loading

yukiman76 commented Nov 21, 2024

davidmezzetti commented Nov 21, 2024

yukiman76 commented Nov 21, 2024

davidmezzetti commented Nov 22, 2024

Add text extraction with Docling #814

Add text extraction with Docling #814

Comments

davidmezzetti commented Nov 20, 2024 • edited Loading

yukiman76 commented Nov 21, 2024

davidmezzetti commented Nov 21, 2024

yukiman76 commented Nov 21, 2024

davidmezzetti commented Nov 22, 2024

davidmezzetti commented Nov 20, 2024 •

edited

Loading