diff --git a/lib/PDF/Data.pm b/lib/PDF/Data.pm index 27188dd..1a0945c 100644 --- a/lib/PDF/Data.pm +++ b/lib/PDF/Data.pm @@ -1365,22 +1365,21 @@ structures that can be readily manipulated. my $pdf = PDF::Data->new(-compress => 1, -minify => 1); -Constructor to create an empty PDF::Data object instance. Any arguments -passed to the constructor are treated as key/value pairs, and included in -the C<$pdf> hash object returned from the constructor. When the PDF file -data is generated, this hash is written to the PDF file as the trailer -dictionary. However, hash keys starting with "-" are ignored when writing -the PDF file, as they are considered to be flags or metadata. +Constructor to create an empty PDF::Data object instance. Any arguments passed +to the constructor are treated as key/value pairs, and included in the C<$pdf> +hash object returned from the constructor. When the PDF file data is generated, +this hash is written to the PDF file as the trailer dictionary. However, hash +keys starting with "-" are ignored when writing the PDF file, as they are +considered to be flags or metadata. For example, C<$pdf-E{-compress}> is a flag which controls whether or not -streams will be compressed when generating PDF file data. This flag can be -set in the constructor (as shown above), or set directly on the object. +streams will be compressed when generating PDF file data. This flag can be set +in the constructor (as shown above), or set directly on the object. The C<$pdf-E{-minify}> flag controls whether or not to save space in the -generated PDF file data by removing comments and extra whitespace from -content streams. This flag can be used along with C<$pdf-E{-compress}> -to make the generated PDF file data even smaller, but this transformation -is not reversible. +generated PDF file data by removing comments and extra whitespace from content +streams. This flag can be used along with C<$pdf-E{-compress}> to make the +generated PDF file data even smaller, but this transformation is not reversible. =head2 clone @@ -1390,9 +1389,15 @@ Deep copy the entire PDF::Data object itself. =head2 new_page + my $page = $pdf->new_page; + my $page = $pdf->new_page('LETTER'); my $page = $pdf->new_page(8.5, 11); -Create a new page object with the specified size. +Create a new page object with the specified size (in inches). Alternatively, +certain page sizes may be specified using one of the known keywords: "LETTER" +for U.S. Letter size (8.5" x 11"), "LEGAL" for U.S. Legal size (8.5" x 14"), or +"A0" through "A8" for ISO A-series paper sizes. The default page size is U.S. +Letter size (8.5" x 11"). =head2 copy_page @@ -1411,36 +1416,36 @@ Append the specified page object to the end of the PDF page tree. my $pdf = PDF::Data->read_pdf($file, %args); Read a PDF file and parse it with C<$pdf-Eparse_pdf()>, returning a new -object instance. Any streams compressed with the /FlateDecode filter -will be automatically decompressed. Unless the C<$pdf-E{-decompress}> -flag is set, the same streams will also be automatically recompressed -again when generating PDF file data. +object instance. Any streams compressed with the /FlateDecode filter will be +automatically decompressed. Unless the C<$pdf-E{-decompress}> flag is set, +the same streams will also be automatically recompressed again when generating +PDF file data. =head2 parse_pdf my $pdf = PDF::Data->parse_pdf($data, %args); -Used by C<$pdf-Eread_pdf()> to parse the raw PDF file data and create -a new object instance. This method can also be called directly instead -of calling C<$pdf-Eread_pdf()> if the PDF file data comes another source -instead of a regular file. +Used by C<$pdf-Eread_pdf()> to parse the raw PDF file data and create a new +object instance. This method can also be called directly instead of calling +C<$pdf-Eread_pdf()> if the PDF file data comes another source instead of a +regular file. =head2 write_pdf $pdf->write_pdf($file, $time); -Generate and write a new PDF file from the current state of the PDF data. +Generate and write a new PDF file from the current state of the PDF::Data +object. -The C<$time> parameter is optional; if not defined, it defaults to the -current time. If C<$time> is defined but false (zero or empty string), -no timestamp will be set. +The C<$time> parameter is optional; if not defined, it defaults to the current +time. If C<$time> is defined but false (zero or empty string), no timestamp +will be set. The optional C<$time> parameter may be used to specify the modification -timestamp to save in the PDF metadata and to set the file modification -timestamp of the output file. If not specified, it defaults to the -current time. If a false value is specified, this method will skip -setting the modification time in the PDF metadata, and skip setting the -timestamp on the output file. +timestamp to save in the PDF metadata and to set the file modification timestamp +of the output file. If not specified, it defaults to the current time. If a +false value is specified, this method will skip setting the modification time in +the PDF metadata, and skip setting the timestamp on the output file. =head2 pdf_file_data @@ -1448,26 +1453,29 @@ timestamp on the output file. Generate PDF file data from the current state of the PDF data structure, suitable for writing to an output PDF file. This method is used by the -C method to generate the raw string of bytes to be written -to the output PDF file. This data can be directly used (e.g. as a MIME +C<$pdf-Ewrite_pdf()> method to generate the raw string of bytes to be +written to the output PDF file. This data can be directly used (e.g. as a MIME attachment) without the need to actually write a PDF file to disk. The optional C<$time> parameter may be used to specify the modification -timestamp to save in the PDF metadata. If not specified, it defaults to -the current time. If a false value is specified, this method will skip -setting the modification time in the PDF metadata. +timestamp to save in the PDF metadata. If not specified, it defaults to the +current time. If a false value is specified, this method will skip setting the +modification time in the PDF metadata. =head2 dump_pdf - $pdf->dump_pdf($file); + $pdf->dump_pdf($file, $mode); -Dump the PDF internal structure and data for debugging. +Dump the PDF internal structure and data for debugging. If the C<$mode> +parameter is "outline", dump only the PDF internal structure without the data. =head2 dump_outline $pdf->dump_outline($file); -Dump an outline of the PDF internal structure for debugging. +Dump an outline of the PDF internal structure for debugging. (This method +simply calls the C<$pdf-Edump_pdf()> method with the C<$mode> parameter +specified as "outline".) =head2 merge_content_streams @@ -1477,15 +1485,35 @@ Merge multiple content streams into a single content stream. =head2 find_bbox - $pdf->find_bbox($content_stream); + $pdf->find_bbox($content_stream, $new); + +Analyze a content stream to determine the correct bounding box for the content +stream. The current implementation was purpose-built for a specific use case +and should not be expected to work correctly for most content streams. + +The C<$content_stream> parameter may be a stream object or a string containing +the raw content stream data. + +The current algorithm breaks the content stream into lines, skips over various +"neutral" lines and examines the coordinates specified for certain PDF drawing +operators: "m" (moveto), "l" (lineto), "v" (curveto, initial point replicated), +"y" (curveto, final point replicated), and "c" (curveto, all points specified). -Find bounding box by analyzing a content stream. This is only partially implemented. +The minimum and maximum X and Y coordinates seen for these drawing operators are +used to determine the bounding box (left, bottom, right, top) for the content +stream. The bounding box and equivalent rectangle (left, bottom, width, height) +are printed. + +If the C<$new> boolean parameter is set, an updated content stream is generated +with the coordinates adjusted to move the lower left corner of the bounding box +to (0, 0). This would be better done by translating the transformation matrix. =head2 new_bbox $new_content = $pdf->new_bbox($content_stream); -Find bounding box by analyzing a content stream. This is only partially implemented. +This method simply calls the C<$pdf-Efind_bbox()> method above with C<$new> +set to 1. =head2 timestamp @@ -1500,29 +1528,30 @@ Generate timestamp in PDF internal format. my @numbers = $pdf->round(@numbers); -Round numeric values to 12 significant digits to avoid floating-point rounding error and -remove trailing zeroes. +Round numeric values to 12 significant digits to avoid floating-point rounding +error and remove trailing zeroes. =head2 concat_matrix my $matrix = $pdf->concat_matrix($transformation_matrix, $original_matrix); -Concatenate a transformation matrix with an original matrix, returning a new matrix. -This is for arrays of 6 elements representing standard 3x3 transformation matrices -as used by PostScript and PDF. +Concatenate a transformation matrix with an original matrix, returning a new +matrix. This is for arrays of 6 elements representing standard 3x3 +transformation matrices as used by PostScript and PDF. =head2 invert_matrix my $inverse = $pdf->invert_matrix($matrix); -Calculate the inverse of a matrix, if possible. Returns undef if not invertible. +Calculate the inverse of a matrix, if possible. Returns C if the matrix +is not invertible. =head2 translate my $matrix = $pdf->translate($x, $y); -Returns a 6-element transformation matrix representing translation of the origin to -the specified coordinates. +Returns a 6-element transformation matrix representing translation of the origin +to the specified coordinates. =head2 scale @@ -1535,8 +1564,8 @@ space by the specified horizontal and vertical scaling factors. my $matrix = $pdf->rotate($angle); -Returns a 6-element transformation matrix representing counterclockwise rotation of -the coordinate system by the specified angle (in degrees). +Returns a 6-element transformation matrix representing counterclockwise rotation +of the coordinate system by the specified angle (in degrees). =head1 INTERNAL METHODS @@ -1544,95 +1573,293 @@ the coordinate system by the specified angle (in degrees). $pdf->validate; -Used by C, C and C to validate some parts of -the PDF structure. +Used by C<$pdf-Enew()>, C<$pdf-Eparse_pdf()> and +C<$pdf-Ewrite_pdf()> to validate some parts of the PDF structure. +Currently, C<$pdf-Evalidate()> uses C<$pdf-Evalidate_key()> to verify +that the document catalog and page tree root node exist and have the correct +type, and that the page tree root node has no parent node. Then it calls +C<$pdf-Evalidate_page_tree()> to validate the entire page tree. + +By default, if a validation error occurs, it will be output as warnings, but +the C<$pdf-E{-validate}> flag can be set to make the errors fatal. + +=head2 validate_page_tree + + my $count = $pdf->validate_page_tree($path, $page_tree_node); + +Used by C<$pdf-Evalidate()>, and called by itself recursively, to validate +the PDF page tree and its subtrees. The C<$path> parameter specifies the +logical path from the root of the PDF::Data object to the page subtree, and the +C<$page_tree_node> parameter specifies the actual page tree node data structure +represented by that logical path. C<$pdf-Evalidate()> initially calls +C<$pdf-Evalidate_page_tree()> with "Root/Pages" for C<$path> and +C<$pdf-E{Root}{Pages}> for C<$page_tree_node>. + +Each child of the page tree node (in C<$page_tree_node-E{Kids}>) should be +another page tree node for a subtree or a single page node. In either case, the +parameters used for the next method call will be C<"$path\[$i]"> for C<$path> +(e.g. "Root/Pages[0][1]") and C<$page_tree_node-E{Kids}[$i]> for +C<$page_tree_node> (e.g. C<$pdf-E{Root}{Pages}{Kids}[0]{Kids}[1]>). These +parameters are passed to either C<$pdf-Evalidate_page_tree()> recursively +(if the child is a page tree node) or to C<$pdf-Evalidate_page()> (if the +child is a page node). + +After validating the page tree, C<$pdf-Evalidate_resources()> will be called +to validate the page tree's resources, if any. + +If the count of pages in the page tree is incorrect, it will be fixed. This +method returns the total number of pages in the specified page tree. + +=head2 validate_page + + $pdf->validate_page($path, $page); + +Used by C<$pdf-Evalidate_page_tree()> to validate a single page of the PDF. +The C<$path> parameter specifies the logical path from the root of the PDF::Data +object to the page, and the C<$page> parameter specifies the actual page data +structure represented by that logical path. + +This method will call C<$pdf-Emerge_content_streams()> to merge the content +streams into a single content stream (if C<$page-E{Contents}> is an array), +then it will call C<$pdf-Evalidate_content_stream()> to validate the page's +content stream. + +After validating the page, C<$pdf-Evalidate_resources()> will be called to +validate the page's resources, if any. + +=head2 validate_resources + + $pdf->validate_resources($path, $resources); + +Used by C<$pdf-Evalidate_page_tree()>, C<$pdf-Evalidate_page()> and +C<$pdf-Evalidate_xobject()> to validate associated resources. The C<$path> +parameter specifies the logical path from the root of the PDF::Data object to +the resources, and the C<$resources> parameter specifies the actual resources +data structure represented by that logical path. + +This method will call C for C<$resources-E{XObject}>, if +set. + +=head2 validate_xobjects + + $pdf->validate_xobjects($path, $xobjects); + +Used by C<$pdf-Evalidate_resources()> to validate form XObjects in the +resources. The C<$path> parameter specifies the logical path from the root of +the PDF::Data object to the hash of form XObjects, and the C<$xobjects> +parameter specifies the actual hash of form XObjects represented by that logical +path. + +This method simply loops across all the form XObjects in C<$xobjects> and calls +C<$pdf-Evalidate_xobject()> for each of them. + +=head2 validate_xobject + + $pdf->validate_xobject($path, $xobject); + +Used by C<$pdf-Evalidate_xobjects()> to validate a form XObject. The +C<$path> parameter specifies the logical path from the root of the PDF::Data +object to the form XObject, and the C<$xobject> parameter specifies the actual +form XObject represented by that logical path. + +This method verifies that C<$xobject> is a stream and C<$xobject-E{Subtype}> +is "/Form", then calls C<$pdf-Evalidate_content_stream()> with C<$xobject> +to validate the form XObject content stream, then calls +C<$pdf-Evalidate_resources()> to validate the form XObject's resources, if +any. + +=head2 validate_content_stream + + $pdf->validate_content_stream($path, $stream); + +Used by C<$pdf-Evalidate_page()> and C<$pdf-Evalidate_xobject()> to +validate a content stream. The C<$path> parameter specifies the logical path +from the root of the PDF::Data object to the content stream, and the C<$stream> +parameter specifies the actual content stream represented by that logical path. + +This method calls C<$pdf-Eparse_objects()> to make sure that the content +stream can be parsed. If the C<$pdf-E{-minify}> flag is set, +C<$pdf-Eminify_content_stream()> will be called with the array of parsed +objects to minify the content stream. + +=head2 minify_content_stream + + $pdf->minify_content_stream($stream, $objects); + +Used by C<$pdf-Evalidate_content_stream()> to minify a content stream. The +C<$stream> parameter specifies the content stream to be modified, and the +optional C<$objects> parameter specifies a reference to an array of parsed +objects as returned by C<$pdf-Eparse_objects()>. + +This method calls C<$pdf-Eparse_objects()> to populate the C<$objects> +parameter if unspecified, then it calls C<$pdf-Egenerate_content_stream()> +to generate a minimal content stream for the array of objects, with no comments +and only the minimum amount of whitespace necessary to parse the content stream +correctly. (Obviously, this means that this transformation is not reversible.) + +Currently, this method also performs a sanity check by running the replacement +content stream through C<$pdf-Eparse_objects()> and comparing the entire +list of objects returned against the original list of objects to ensure that the +replacement content stream is equivalent to the original content stream. + +=head2 generate_content_stream + + my $data = $pdf->generate_content_stream($objects); + +Used by C<$pdf-Eminify_content_stream()> to generate a minimal content +stream to replace the original content stream. The C<$objects> parameter +specifies a reference to an array of parsed objects as returned by +C<$pdf-Eparse_objects()>. These objects will be used to generate the new +content stream. + +For each object in the array, this method will call an appropriate serialization +method: C<$pdf-Eserialize_dictionary()> for dictionary objects, +C<$pdf-Eserialize_array()> for array objects, or +C<$pdf-Eserialize_object()> for other objects. After serializing all the +objects, the newly-generated content stream data is returned. + +=head2 serialize_dictionary + + $pdf->serialize_dictionary($stream, $hash); + +Used by C<$pdf-Egenerate_content_stream()>, +C<$pdf-Eserialize_dictionary()> (recursively) and +C<$pdf-Eserialize_array()> to serialize a hash as a dictionary object. The +C<$stream> parameter specifies a reference to a string containing the data for +the new content stream being generated, and the C<$hash> parameter specifies the +hash reference to be serialized. + +This method will serialize all the key-value pairs of C<$hash>, prefixing each +key in the hash with "/" to serialize the key as a name object, and calling an +appropriate serialization routine for each value in the hash: +C<$pdf-Eserialize_dictionary()> for dictionary objects (recursive call), +C<$pdf-Eserialize_array()> for array objects, or +C<$pdf-Eserialize_object()> for other objects. + +=head2 serialize_array + + $pdf->serialize_array($stream, $array); + +Used by C<$pdf-Egenerate_content_stream()>, +C<$pdf-Eserialize_dictionary()> and C<$pdf-Eserialize_array()> +(recursively) to serialize an array. The C<$stream> parameter specifies a +reference to a string containing the data for the new content stream being +generated, and the C<$array> parameter specifies the array reference to be +serialized. + +This method will serialize all the array elements of C<$array>, calling an +appropriate serialization routine for each element of the array: +C<$pdf-Eserialize_dictionary()> for dictionary objects, +C<$pdf-Eserialize_array()> for array objects (recursive call), or +C<$pdf-Eserialize_object()> for other objects. + +=head2 serialize_object + + $pdf->serialize_object($stream, $object); + +Used by C<$pdf-Egenerate_content_stream()>, +C<$pdf-Eserialize_dictionary()> and C<$pdf-Eserialize_array()> +to serialize a simple object. The C<$stream> parameter specifies a reference to +a string containing the data for the new content stream being generated, and the +C<$object> parameter specifies the pre-serialized object to be serialized to the +specified content stream data. + +This method will strip leading and trailing whitespace from the pre-serialized +object if the C<$pdf-E{-minify}> flag is set, then append a newline +to C<${$stream}> if appending the pre-serialized object would exceed 255 +characters for the last line, then append a space to C<${$stream}> if necessary +to parse the object correctly, then append the pre-serialized object to +C<${$stream}>. =head2 validate_key $pdf->validate_key($hash, $key, $value, $label); -Used by C to validate specific hash key values. +Used by C<$pdf-Evalidate()> to validate specific hash key values. =head2 get_hash_node my $hash = $pdf->get_hash_node($path); -Used by C to get a hash node from the PDF structure by path. +Used by C<$pdf-Evalidate_key()> to get a hash node from the PDF structure by +path. =head2 parse_objects my @objects = $pdf->parse_objects($objects, $data, $offset); -Used by C to parse PDF objects into Perl representations. +Used by C<$pdf-Eparse_pdf()> to parse PDF objects into Perl representations. -=head2 parse_content +=head2 parse_data my @objects = $pdf->parse_data($data); -Uses C to parse PDF objects from standalone PDF data. +Uses C<$pdf-Eparse_objects()> to parse PDF objects from standalone PDF data. =head2 filter_stream $pdf->filter_stream($stream); -Used by C to inflate compressed streams. +Used by C<$pdf-Eparse_objects()> to inflate compressed streams. =head2 compress_stream $new_stream = $pdf->compress_stream($stream); -Used by C to compress streams if enabled. This is controlled -by the C<$pdf-E{-compress}> flag, which is set automatically when reading a -PDF file with compressed streams, but must be set manually for PDF files -created from scratch, either in the constructor arguments or after the fact. +Used by C<$pdf-Ewrite_object()> to compress streams if enabled. This is +controlled by the C<$pdf-E{-compress}> flag, which is set automatically when +reading a PDF file with compressed streams, but must be set manually for PDF +files created from scratch, either in the constructor arguments or after the +fact. =head2 resolve_references $object = $pdf->resolve_references($objects, $object); -Used by C to replace parsed indirect object references with -direct references to the objects in question. +Used by C<$pdf-Eparse_pdf()> to replace parsed indirect object references +with direct references to the objects in question. =head2 write_indirect_objects my $xrefs = $pdf->write_indirect_objects($pdf_file_data, $objects, $seen); -Used by C to write all indirect objects to a string of new -PDF file data. +Used by C<$pdf-Ewrite_pdf()> to write all indirect objects to a string of +new PDF file data. =head2 enumerate_indirect_objects $pdf->enumerate_indirect_objects($objects); -Used by C to identify which objects in the PDF -data structure need to be indirect objects. +Used by C<$pdf-Ewrite_indirect_objects()> to identify which objects in the +PDF data structure need to be indirect objects. =head2 enumerate_shared_objects $pdf->enumerate_shared_objects($objects, $seen, $ancestors, $object); -Used by C to find objects which are already -shared (referenced from multiple objects in the PDF data structure). +Used by C<$pdf-Eenumerate_indirect_objects()> to find objects which are +already shared (referenced from multiple objects in the PDF data structure). =head2 add_indirect_objects $pdf->add_indirect_objects($objects, @objects); -Used by C and C -to add objects to the list of indirect objects to be written out. +Used by C<$pdf-Eenumerate_indirect_objects()> and +C<$pdf-Eenumerate_shared_objects()> to add objects to the list of indirect +objects to be written out. =head2 write_object $pdf->write_object($pdf_file_data, $objects, $seen, $object, $indent); -Used by C, and called by itself recursively, to -write direct objects out to the string of new PDF file data. +Used by C<$pdf-Ewrite_indirect_objects()>, and called by itself recursively, +to write direct objects out to the string of new PDF file data. =head2 dump_object my $output = $pdf->dump_object($object, $label, $seen, $indent, $mode); -Used by C, and called by itself recursively, to dump/outline -the specified PDF object. +Used by C<$pdf-Edump_pdf()>, and called by itself recursively, to dump (or +outline) the specified PDF object. =cut