convert files into markdown
Supported file types (and processing engine):
- docx (python-docx)
- excel (python-calamine)
- pdf (pypdf / vision-language-models)
- eml (email)
from pathlib import Path
from markmagic import convert_any
with Path("tests/data/docx/msft_pr.docx").open("rb") as f:
convert_any(filename="msft_pr.docx", file=f)
If you're interested in using vision language models to ocr a pdf
Create a .env file in the root directory
API_KEY="REPLACE"
from pathlib import Path
from markmagic import convert_any
with Path("tests/data/pdf/msft_ar.pdf").open("rb") as f:
settings = Settings(use_vlm=True)
convert_any(filename="msft_ar.pdf", file=f, settings=settings)
- markmagic only looks at the file extension to decide how to convert your files
- markmagic uses python-docx so cannot extract text from shapes / images (consider using python-mammoth + markdownify)
- Most consistent way of sending data to llms is in markdown
- Understand python tooling landscape and what a set of good lightweight options look like
- OCR is just neural nets so why not just use vision language models for ocr?
- OCRBenchmark https://github.com/open-compass/VLMEvalKit?tab=readme-ov-file
- TBD