Skip to content
Eliot Jones edited this page May 9, 2021 · 1 revision

PDF files can contain 2 types of images:

  • Inline: these images appear inline in the page's content stream (basically PostScript code defining the appearance of the page). These are generally used for small images.
  • XObjects: Larger images are defined outside the content stream and referenced from the content stream by name using an operator.

There are some differences between the types of information stored depending on how the image is defined. PdfPig defines both InlineImage and XObjectImage which both implement IPdfImage.

The images for a page are accessed via the Page.GetImages() method which returns the set of images on the page.

The IPdfImage has properties for the placement rectangle of the image on the page Bounds as well as the width and the height of the original image, before any PDF transforms are applied (WidthInSamples and HeightInSamples, where samples are usually pixels).

Since PDF content may define many different ColorSpaces for rendering not all of these are yet supported by PdfPig. Where the ColorSpace is common, e.g. DeviceGray, DeviceRGB, DeviceCMYK decoding of the image to a PNG is supported. Other ColorSpaces are either not supported or only have partial support. IPdfImage defines the ColorSpace and ColorSpaceDetails properties for more information of the active ColorSpace when this image was rendered to the page.

The actual content of the image bytes is either:

  • A PDF format bitmap based on the ColorSpace.
  • A JPEG file directly embedded in the file.

Where the image is a JPEG decoding the bytes is not supported directly (IPdfImage.TryGetBytes(out var bytes) will return false). The IPdfImage.RawBytes is a valid JPEG file. Where the image is in PDF format the RawBytes are usually the bitmap with one or more PDF filters applied (FlateDecode etc.). IPdfImage.TryGetBytes(out var bytes) will return the bytes after reversing these filters in PDF format. The actual bytes are then subject to interpretation based on the ColorSpace, the bits per component, width and height in samples, etc.

For common image types the IPdfImage.TryGetPng(out byte[] bytes) will take the raw bytes, decode the raw data by reversing the filters and convert the resulting PDF bitmap into a valid PNG file. Since not all ColorSpaces are yet supported this won't support every image. Where PNG creation is successful the resulting bytes can be interpreted as a valid PNG image.

Clone this wiki locally