Skip to content

How to extract text in natural reading order (up2down, left2right)

Aaron Taylor edited this page Jun 11, 2023 · 2 revisions

Easiest way

First of all, use SortedCollection.

from operator import itemgetter
from itertools import groupby

import fitz

doc = fitz.open( 'mydocument.pdf' )

for page in doc:
  text_words = page.get_text_words()

  # The words should be ordered by y1 and x0
  sorted_words = SortedCollection( key = itemgetter( 3, 0 ) )

  for word in text_words:
    sorted_words.insert( word )

  # At this point you already have an ordered list. If you need to 
  # group the content by lines, use groupby with y1 as a key
  lines = groupby( sorted_words, key = itemgetter( 3 ) )

  # Enjoy!
Clone this wiki locally