markmagic

convert files into markdown

Supported file types (and processing engine):

docx (python-docx)
excel (python-calamine)
pdf (pypdf / vision-language-models)
eml (email)

Getting started

from pathlib import Path
from markmagic import convert_any

with Path("tests/data/docx/msft_pr.docx").open("rb") as f:
    convert_any(filename="msft_pr.docx", file=f)

If you're interested in using vision language models to ocr a pdf

Create a .env file in the root directory

API_KEY="REPLACE"

from pathlib import Path
from markmagic import convert_any

with Path("tests/data/pdf/msft_ar.pdf").open("rb") as f:
    settings = Settings(use_vlm=True)
    convert_any(filename="msft_ar.pdf", file=f, settings=settings)

Design / Limitations

markmagic only looks at the file extension to decide how to convert your files
markmagic uses python-docx so cannot extract text from shapes / images (consider using python-mammoth + markdownify)

Goals / Motivation

Most consistent way of sending data to llms is in markdown
Understand python tooling landscape and what a set of good lightweight options look like
OCR is just neural nets so why not just use vision language models for ocr?
OCRBenchmark https://github.com/open-compass/VLMEvalKit?tab=readme-ov-file

TODOs:

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.github		.github
markmagic		markmagic
tests		tests
typings		typings
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
tox.ini		tox.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

markmagic

Getting started

Design / Limitations

Goals / Motivation

TODOs:

About

Releases 2

Languages

yeungadrian/markmagic

Folders and files

Latest commit

History

Repository files navigation

markmagic

Getting started

Design / Limitations

Goals / Motivation

TODOs:

About

Resources

Stars

Watchers

Forks

Releases 2

Languages