Installation¶ ↑

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.

It provides programmatic access to the contents of a PDF file with a high degree of flexibility.

The PDF 1.7 specification is a weighty document and not all aspects are currently supported. We welcome submission of PDF files that exhibit unsupported aspects of the spec to assist with improving out support.

Installation¶ ↑

The recommended installation method is via Rubygems.

gem install pdf-reader

Usage¶ ↑

PDF::Reader is designed with a callback-style architecture. The basic concept is to build a receiver class and pass that into PDF::Reader along with the PDF to process.

As PDF::Reader walks the file and encounters various objects (pages, text, images, shapes, etc) it will call methods on the receiver class. What those methods do is entirely up to you - save the text, extract images, count pages, read metadata, whatever.

For a full list of the supported callback methods and a description of when they will be called, refer to PDF::Reader::Content. See the code examples below for a way to print a list of all the callbacks generated by a file to STDOUT.

Text Encoding¶ ↑

Internally, text can be stored inside a PDF in various encodings, including zingbats, win-1252, mac roman and a form of Unicode. To avoid confusion, all text will be converted to UTF-8 before it is passed back from PDF::Reader.

Exceptions¶ ↑

There are two key exceptions that you will need to watch out for when processing a PDF file:

MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the file should be valid, or that a corrupt file didn’t raise an exception, please forward a copy of the file to the maintainers (preferably via the google group) and we can attempt to improve the code.

UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn’t currently support. Again, we welcome submissions of PDF files that exhibit these features to help us with future code improvements.

MalformedPDFError has some subclasses if you want to detect finer grained issues. If you don’t, ‘rescue MalformedPDFError’ will catch all the subclassed errors as well.

Any other exceptions should be considered bugs in either PDF::Reader (please report it!) or your receiver (please don’t report it!).

Maintainers¶ ↑

James Healy <jimmy@deefa.com>

Mailing List¶ ↑

Any questions or feedback should be sent to the PDF::Reader google group. It’s better that any answers be available for others instead of hiding in someone’s inbox.

groups.google.com/group/pdf-reader

Examples¶ ↑

The easiest way to explain how this works in practice is to show some examples.

Naïve Page Counter¶ ↑

A simple app to count the number of pages in a PDF File.

require 'rubygems'
require 'pdf/reader'

class PageReceiver
  attr_accessor :counter

  def initialize
    @counter = 0
  end

  # Called when page parsing ends
  def end_page
    @counter += 1
  end
end

receiver = PageReceiver.new
pdf = PDF::Reader.file("somefile.pdf", receiver)
puts "#{receiver.counter} pages"

List all callbacks generated by a single PDF¶ ↑

WARNING: this will generate a lot of output, so you probably want to pipe it through less or to a text file.

require 'rubygems'
require 'pdf/reader'

receiver = PDF::Reader::RegisterReceiver.new
pdf = PDF::Reader.file("somefile.pdf", receiver)
receiver.callbacks.each do |cb|
  puts cb
end

Extract all text from a single PDF¶ ↑

class PageTextReceiver
  attr_accessor :content

  def initialize
    @content = []
  end

  # Called when page parsing starts
  def begin_page(arg = nil)
    @content << ""
  end

  # record text that is drawn on the page
  def show_text(string, *params)
    @content.last << string.strip
  end

  # there's a few text callbacks, so make sure we process them all
  alias :super_show_text :show_text
  alias :move_to_next_line_and_show_text :show_text
  alias :set_spacing_next_line_show_text :show_text

  # this final text callback takes slightly different arguments
  def show_text_with_positioning(*params)
    params = params.first
    params.each { |str| show_text(str) if str.kind_of?(String)}
  end
end

receiver = PageTextReceiver.new
pdf = PDF::Reader.file("somefile.pdf", receiver)
puts receiver.content.inspect

Extract metadata only¶ ↑

require 'rubygems'
require 'pdf/reader'

class MetaDataReceiver
  attr_accessor :regular
  attr_accessor :xml

  def metadata(data)
    @regular = data
  end

  def metadata_xml(data)
    @xml = data
  end
end

receiver = MetaDataReceiver.new
pdf = PDF::Reader.file(ARGV.shift, receiver, :pages => false, :metadata => true)
puts receiver.regular.inspect
puts receiver.xml.inspect

Improved Page Counter¶ ↑

A simple app to display the number of pages in a PDF File.

require 'rubygems'
require 'pdf/reader'

class PageReceiver
  attr_accessor :pages

  # Called when page parsing ends
  def page_count(arg)
    @pages = arg
  end
end

receiver = PageReceiver.new
pdf = PDF::Reader.file("somefile.pdf", receiver, :pages => false)
puts "#{receiver.pages} pages"

Basic RSpec of a generated PDF¶ ↑

require 'rubygems'
require 'pdf/reader'
require 'pdf/writer'
require 'spec'

class PageTextReceiver
  attr_accessor :content

  def initialize
    @content = []
  end

  # Called when page parsing starts
  def begin_page(arg = nil)
    @content << ""
  end

  def show_text(string, *params)
    @content.last << string.strip
  end

  # there's a few text callbacks, so make sure we process them all
  alias :super_show_text :show_text
  alias :move_to_next_line_and_show_text :show_text
  alias :set_spacing_next_line_show_text :show_text

  def show_text_with_positioning(*params)
    params = params.first
    params.each { |str| show_text(str) if str.kind_of?(String)}
  end
end

context "My generated PDF" do
  specify "should have the correct text on 2 pages" do

    # generate our PDF
    pdf = PDF::Writer.new
    pdf.text "Chunky", :font_size => 32, :justification => :center
    pdf.start_new_page
    pdf.text "Bacon", :font_size => 32, :justification => :center
    pdf.save_as("chunkybacon.pdf")

    # process the PDF
    receiver = PageTextReceiver.new
    PDF::Reader.file("chunkybacon.pdf", receiver)

    # confirm the text appears on the correct pages
    receiver.content.size.should eql(2)
    receiver.content[0].should eql("Chunky")
    receiver.content[1].should eql("Bacon")
  end
end

Known Limitations¶ ↑

The order of the callbacks is unpredicable, and is dependent on the internal layout of the file, not the order objects are displayed to the user. As a consequence of this it is highly unlikely that text will be completely in order.

Occasionally some text cannot be extracted properly due to the way it has been stored, or the use of invalid bytes. In these cases PDF::Reader will output a little UTF-8 friendly box to indicate an unrecognisable character.

Resources¶ ↑

PDF::Reader Code Repository: github.com/yob/pdf-reader
PDF::Reader Rubyforge Page: rubyforge.org/projects/pdf-reader/
PDF Specification: www.adobe.com/devnet/pdf/pdf_reference.html
PDF Tutorial Slide Presentations: home.comcast.net/~jk05/presentations/PDFTutorials.html

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
bin		bin
lib/pdf		lib/pdf
specs		specs
.gitignore		.gitignore
CHANGELOG		CHANGELOG
README.rdoc		README.rdoc
Rakefile		Rakefile
TODO		TODO
pdf-reader.gemspec		pdf-reader.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation¶ ↑

Usage¶ ↑

Text Encoding¶ ↑

Exceptions¶ ↑

Maintainers¶ ↑

Mailing List¶ ↑

Examples¶ ↑

Naïve Page Counter¶ ↑

List all callbacks generated by a single PDF¶ ↑

Extract all text from a single PDF¶ ↑

Extract metadata only¶ ↑

Improved Page Counter¶ ↑

Basic RSpec of a generated PDF¶ ↑

Known Limitations¶ ↑

Resources¶ ↑

About

Releases

Packages

Languages

crishoj/pdf-reader

Folders and files

Latest commit

History

Repository files navigation

Installation¶ ↑

Usage¶ ↑

Text Encoding¶ ↑

Exceptions¶ ↑

Maintainers¶ ↑

Mailing List¶ ↑

Examples¶ ↑

Naïve Page Counter¶ ↑

List all callbacks generated by a single PDF¶ ↑

Extract all text from a single PDF¶ ↑

Extract metadata only¶ ↑

Improved Page Counter¶ ↑

Basic RSpec of a generated PDF¶ ↑

Known Limitations¶ ↑

Resources¶ ↑

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages