Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Easily extract data from tables trapped in PDFs.
Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer
and pdfminer.six
.
pip install pdfplumber
curl "https://cdn.rawgit.com/jsvine/pdfplumber/master/examples/pdfs/background-checks.pdf" > background-checks.pdf
pdfplumber < background-checks.pdf > background-checks.csv
The output will be a CSV containing info about every character, line, and rectangle in the PDF.
--format [format]
:csv
orjson
- The
json
format returns slightly more information; it includes PDF-level metadata and height/width information about each page.
- The
--pages [list of pages]
: A space-delimited,1
-indexed list of pages or hyphenated page ranges. E.g.,1, 11-15
, which would return data for pages 1, 11, 12, 13, 14, and 15.--types [list of object types to extract]
: Choices arechar
,anno
,line
,rect
,rect_edge
. Defaults tochar
,anno
,line
,rect
.
import pdfplumber
with pdfplumber.open("path/to/file.pdf") as pdf:
first_page = pdf.pages[0]
print(first_page.chars[0])
pdfplumber
provides two main ways to load a PDF:
pdfplumber.open("path/to/file.pdf")
pdfplumber.load(file_like_object)
Both methods return an instance of the pdfplumber.PDF
class.
The top-level pdfplumber.PDF
class represents a single PDF and has two main properties:
-
.metadata
: A dictionary of metadata key/value pairs, drawn from the PDF'sInfo
trailers. Typically includes "CreationDate," "ModDate," "Producer," et cetera. -
.pages
: A list containing onepdfplumber.Page
instance per page loaded.
The pdfplumber.Page
class is at the core of pdfplumber
. Most things you'll do with pdfplumber
will revolve around this class. It has these main methods and properties:
-
.pageid
: The internal ID assigned to this page. Often, this number will correspond to the page number, but not necessarily. -
.width
: The page's width. -
.height
: The page's height. -
.objects
/.chars
/.lines
/.rects
: Each of these properties is a list, and each list contains one dictionary for each such object embedded on the page. For more detail, see "Objects" below. -
.crop(bounding_box, strict=False)
: Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values(x0, top, x1, bottom)
.- By default, the cropped page retains objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box.
- Calling
.crop
withstrict=True
, however, retains only objects that fall entirely within the bounding box.
-
.filter(test_function)
: Returns a version of the page with only the.objects
for whichtest_function(obj)
returnsTrue
. -
.extract_text(x_tolerance=0, y_tolerance=0)
: Collates all of the page's character objects into a single string. Adds spaces where the difference between thex1
of one character and thex0
of the next is greater thanx_tolerance
. Adds newline characters where the difference between thedoctop
of one character and thedoctop
of the next is greater thany_tolerance
. -
.extract_words(x_tolerance=0, y_tolerance=0)
: Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where the difference between thex1
of one character and thex0
of the next is less than or equal tox_tolerance
and where thedoctop
of one character and thedoctop
of the next is less than or equal toy_tolerance
. -
.extract_table(...)
: Extracts tabular data from the page. For more details see "Extracting tables" below.
Each instance of pdfplumber.PDF
and pdfplumber.Page
provides access to four types of PDF objects. The following properties each return a Python list of the matching objects:
.chars
, each representing a single text character..annos
, each representing a single annotation-text character..lines
, each representing a single 1-dimensional line..rects
, each representing a single 2-dimensional rectangle.
Each object is represented as a simple Python dict
, with the following properties:
-
char
/anno
:pageid
: Page ID on which this character was found.text
: E.g., "z", or "Z" or " ".fontname
: Name of the character's font face.size
: Font size.adv
: Equal to text width * the font size * scaling factor.upright
: Whether the character is upright.height
: Height of the character.width
: Width of the character.x0
: Distance of left side of character from left side of page.x1
: Distance of right side of character from left side of page.y0
: Distance of bottom of character from bottom of page.y1
: Distance of top of character from bottom of page.top
: Distance of top of character from top of page.bottom
: Distance of bottom of the character from top of page.doctop
: Distance of top of character from top of document.object_type
: "char" / "anno"
-
line
:pageid
: Page ID on which this line was found.height
: Height of line.width
: Width of line.x0
: Distance of left-side extremity from left side of page.x1
: Distance of right-side extremity from left side of page.y0
: Distance of bottom extremity from bottom of page.y1
: Distance of top extremity bottom of page.top
: Distance of top of line from top of page.bottom
: Distance of bottom of the line from top of page.doctop
: Distance of top of line from top of document.linewidth
: Thickness of line.object_type
: "line"
-
rect
:pageid
: Page ID on which this rectangle was found.height
: Height of rectangle.width
: Width of rectangle.x0
: Distance of left side of rectangle from left side of page.x1
: Distance of right side of rectangle from left side of page.y0
: Distance of bottom of rectangle from bottom of page.y1
: Distance of top of rectangle from bottom of page.top
: Distance of top of rectangle from top of page.bottom
: Distance of bottom of the rectangle from top of page.doctop
: Distance of top of rectangle from top of document.linewidth
: Thickness of line.object_type
: "rect"
Additionally, both pdfplumber.PDF
and pdfplumber.Page
provide access to two derived lists of objects: .rect_edges
(which decomposes each rectangle into its four lines) and .edges
(which combines .rect_edges
with .lines
).
You can think of Page.extract_table(...)
as a sort of scriptable Tablula. Given a page or cropped page, this method will return a list of lists representing the extracted table. By default, extract_table
uses the page's vertical and horizontal lines (or rectangle edges) as cell-separators. But the method is highly customizable via these keyword arguments:
-
v=[strategy_name]
: Strategy for locating vertical dividers. Defaults to "lines," which uses lines and rectangle edges. Other options: "lines_strict," which only uses lines (and not rectangle edges), and "gutters," which looks for vertical sections of the document with no text in them. -
h=[strategy_name]
: Strategy for locating horizontal dividers. Same defaults/options asv
. -
line_min_height=[number]
: The minimum height a line must be before being used in the vertical "lines" strategy. Defaults to1
. -
line_min_width=[number]
: The minimum width a line must be before being used in the horizontal "lines" strategy. Defaults to1
. -
gutter_min_width=[number]
: Minimum size of a character "gutter" to be used in the horizontal "gutters" strategy. Defaults to5
. -
gutter_min_height=[number]
: Minimum size of a character "gutter" to be used in the vertical "gutters" strategy. Defaults to5
. -
x_tolerance=[number]
: The maximum horizontal distance between two consecutive characters to consider them part of the same word. (Otherwise, a space is inserted between them.) Defaults to0
. -
x_tolerance=[number]
: The maximum vertical distance between two consecutive characters to consider them part of the same line. (Otherwise, a newline is inserted between them.) Defaults to0
.
Note: Often it's helpful to crop a page before trying to extract the table. See the first demo below for an example.
- Using
extract_table
on the FBI's National Instant Criminal Background Check System PDFs - Using
extract_table
on California's Worker Adjustment and Retraining Notification (WARN) reports — via Dan Nguyen
Support for Python 3 is decent, but rough around the edges and largely dependent on the progress of pdfminer.six
.
Currently tested on Python 2.7, 3.1, 3.3, 3.4, and 3.5.
Special thanks to Jacob Fenton.
Issues and pull requests welcome.