EDS-PDF provides a modular framework to extract text information from PDF documents.
You can use it out-of-the-box, or extend it to fit your specific use case. We provide a pipeline system and various utilities for visualizing and processing PDFs, as well as multiple components to build complex models:complex models:
- 📄 Extractors to parse PDFs (based on pdfminer, mupdf or poppler)
- 🎯 Classifiers to perform text box classification, in order to segment PDFs
- 🧩 Aggregators to produce an aggregated output from the detected text boxes
- 🧠 Trainable layers to incorporate machine learning in your pipeline (e.g., embedding building blocks or a trainable classifier)
Visit the 📖 documentation for more information!
Install the library with pip:
pip install edspdf
Let's build a simple PDF extractor that uses a rule-based classifier. There are two ways to do this, either by using the configuration system or by using the pipeline API.
Create a configuration file:
[pipeline]
pipeline = ["extractor", "classifier", "aggregator"]
[components.extractor]
@factory = "pdfminer-extractor"
[components.classifier]
@factory = "mask-classifier"
x0 = 0.2
x1 = 0.9
y0 = 0.3
y1 = 0.6
threshold = 0.1
[components.aggregator]
@factory = "simple-aggregator"
and load it from Python:
import edspdf
from pathlib import Path
model = edspdf.load("config.cfg") # (1)
Or create a pipeline directly from Python:
from edspdf import Pipeline
model = Pipeline()
model.add_pipe("pdfminer-extractor")
model.add_pipe(
"mask-classifier",
config=dict(
x0=0.2,
x1=0.9,
y0=0.3,
y1=0.6,
threshold=0.1,
),
)
model.add_pipe("simple-aggregator")
This pipeline can then be applied (for instance with this PDF):
# Get a PDF
pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes()
pdf = model(pdf)
body = pdf.aggregated_texts["body"]
text, style = body.text, body.properties
See the rule-based recipe for a step-by-step explanation of what is happening.
If you use EDS-PDF, please cite us as below.
@software{edspdf,
author = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},
doi = {10.5281/zenodo.6902977},
license = {BSD-3-Clause},
title = {{EDS-PDF: Smart text extraction from PDF documents}},
url = {https://github.com/aphp/edspdf}
}
We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.