-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add text extraction with Docling #814
Comments
I started to look at this as I think it would be a great addition, but there seams to be an Library incompatibility, Here is some test code to show the Problem import logging
from txtai import Embeddings # remove this line and this will run
# from chonkie import WordChunker, SemanticChunker, SentenceChunker, SDPMChunker
from docling.document_converter import DocumentConverter
# Configure logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
# Create console handler with a debug level
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.DEBUG)
# Create formatter and add it to the handler
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
console_handler.setFormatter(formatter)
# Add the handler to the logger
logger.addHandler(console_handler)
# Suppress lower-priority logs from the "docling" library
# logging.getLogger("docling").setLevel(logging.DEBUG)
# Configure logging for "docling"
docling_logger = logging.getLogger("docling")
docling_logger.setLevel(logging.DEBUG) # Adjust level as needed (DEBUG, INFO, WARNING)
docling_logger.addHandler(console_handler) # Attach the same console handler
docling_logger.propagate = True # Ensure messages propagate to the root logger
def process_docling(sfile):
markdown_content = ""
try:
logger.info("@@@@" * 10)
logger.info(f"Processing file: {sfile}")
docling_converter = DocumentConverter()
logger.info("Converting file...")
result = docling_converter.convert(sfile)
if result is not None:
logger.info("Exporting to markdown...")
markdown_content = result.document.export_to_markdown()
except Exception as e:
logger.error(f"An error occurred: {e}")
return markdown_content
if __name__ == "__main__":
logger.info("Starting the script...")
file_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
logger.info(f"Input file: {file_url}")
output = process_docling(file_url)
logger.info(f"Output:\n{output}") You get a Crash when you import txtai lib segmentation fault python test.py
# multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ' |
Thanks for the report! Given you are using macOS, I wonder if this is related to Faiss. Usually setting OMP_NUM_THREADS=1 is the one that works. Just mentioned this a couple days ago: #813 I wonder if setting |
running OMP_NUM_THREADS=1 python test.py |
I'm thinking I'm going to add |
Docling looks like a promising text extraction library that could possibly augment or replace Apache Tika.
Update: Docling added 3.9 support, this is a go!
The main integration issue is that it only supports Python 3.10+.The text was updated successfully, but these errors were encountered: