Releases: UglyToad/PdfPig
Put Together In The Same Factory
This release fixes a major regression in 0.0.7 which broke consuming documents via streams. It also adds new features:
- Document Layout Analysis: Adds the
Docstrum
(Doc Spectrum) algorithm for page segmentation. - Document segmentation approaches (
Docstrum
andRecursiveXYCut
) implement theIPageSegmenter
interface which now returns a list ofTextBlock
s.XYLeaf
andXYNode
are now internal. TextEdgesExtractor
is a new class which can be used to detect shared alignment in sections of text.- Letters now have a
Color
property. This is one of the types implementingIColor
. These areGrayColor
,RGBColor
andCMYKColor
, other color spaces are not currently supported and default toGrayColor.Black
. PdfDocument
now has aTryGetXmpMetadata(out XmpMetadata metadata)
method which will retrieve the XML XMP Metadata object from the document if one is present.
And You Have The Milk
This release primarily focuses on more bug-fixing to improve stability of extracting text content. The main new features are full support for encrypted documents, Document Layout Analysis tools and early-access path information.
- Fix a bug using
DefaultWordExtractor
where theLetters
collection on all words would be empty. - Supports UTF-16 encoded strings in document content, such as document information dictionaries, and in
HexToken
based strings. - Supports all forms of document encryption up to and including revision 6 in PDF 2.0 spec.
- Prevents crashes where PDF contains circular object references.
- The new
DocumentLayoutAnalysis
namespace supports nearest-neighbour word extraction and recursive X-Y cut document segmentation.RecursiveXYCut.GetBlocks
implements the Recursive X-Y cut algorithm https://en.wikipedia.org/wiki/Recursive_X-Y_cut.NearestNeighbourWordExtractor
can be provided toPage.GetWords
for a different word extraction technique. - Fix bug where some letters had a width or height of zero.
- More tolerant search for cross-reference offsets, if the cross-reference offsets are incorrect we search for the corresponding object.
- Handle a case where CidFonts contained hex rather than string tokens for registry-ordering-supplement information.
- Support cross-reference tables even if they appear after the first
%%EOF
end of file marker. - Support rotated pages.
Page
now contains aRotation
property indicating if the page is rotated at the top level. Valid values for rotation are 0, 90, 180 and 270. The currently reportedPageSize
does not take rotation into account yet. This also adds support for properly rotating letters and page content. - Change internal letter point size calculation,
Page.ExperimentalAccess.GetPointSize(Letter letter)
now reports the point size with an updated calculation which handles rotated letters. - Map character codes directly to ASCII character values where there's no corresponding Unicode value. This matches PDFBox 1.8/9 behaviour where if no Unicode value can be found, the integer value is mapped directly to a character.
- Expose
PdfPath
information from the page's content stream. Early access to path/geometry information parsed from the page's content. UsePage.ExperimentalAccess.Paths
to access lines, rectangles, curves, etc declared by the page.
Cows In The North
This release focuses on stability improvements and has been tested on far more document types than previous releases. The 2 main new features are support for full framework versions of .NET back to .NET 4.5 making this library available to more users and initial support for encrypted documents using the most basic form of document encryption.
The release may contain a bug in System Font loading which has not been replicated but may make the library crash on some systems. Please file a bug report if you encounter an error on this package version.
- Adds the ability to access all raw operations in a page's content stream. This is the set of instructions which form the graphical features on the page. Access using
page.Operations
. - Supports defining operations on a
PdfPageBuilder
directly usingbuilder.Advanced.Operations
. - Support for full framework .NET versions back to .NET 4.5.
- Support for Compact Font Format CID fonts.
- Support for Standard 14 fonts which are incorrectly declared as TrueType fonts.
- Performance improvements for System Fonts, where the document relies on fonts installed on the host operating system, only tested on Windows.
- Many stability fixes for all font types and parsing documents.
- Text direction added to letter and word. Indicates the rotation of the text.
- Add support for encrypted documents, documents using the newer AES encryption will still throw but RC4 encryption is now supported. A password may be supplied in
ParsingOptions
. - Support for LZW filters which were the last filter left to be implemented.
Cows In The South
Adds new document creation and provides access to per-page annotations.
Red Cake with Great Big Red Cherries
- Reworks the public API of Letter to provide height information. See the Letters page on the wiki.
- Adds support for Type 1 fonts with Compact Font Format fonts and retrieving height information.
- Bug fixes, stability improvements and performance improvements.
- PdfDocument now has a Structure property. This is an UglyToad.PdfPig.Structure object which provides access to the tokenized content of the PDF file and the merged Cross Reference Table in the document. Any objects in the PDF file may be accessed by object reference number allowing consumers to work around missing functionality. All tokens used internally when interpreting PDF documents are available on the public API.
- Page now has a IEnumerable GetWords() method which uses a default word extractor to attempt merging letters into words based on heuristics using letter positions. Consumers may provide their own IWordExtractor to the method to improve on the very basic approach used in this release or continue using the raw letters.
Version 0.0.1
The first non pre-release version.
Alpha 002
Fixes an issue where the only encoding present is embedded in the font program.
Supports reading from streams.
Very Stable Genius
The initial alpha release