Skip to content

Using setMetadata() and setToC()

Jorj X. McKie edited this page Oct 18, 2017 · 4 revisions

These methods allow changing meta information of a PDF document (only). Like the earlier introduced method select(), they are methods in the Document class. Both as well support the incremental save technique.

Standard Metadata

For every MuPDF-supported document type, doc.metadata is a Python dictionary with keys format, author, creator, producer, creationDate, modDate, subject, title, encryption and keywords. This is true whether or not this information (completely) exists for any given document.

Except format and encryption, all of these data can be changed if the document is a PDF.

All you have to do is preparing a Python dictionary m with some or all or the above key-value pairs and invoke doc.setMetadata(m).

Any above key not contained in this dictionary, will receive a value of none. If you provide an empty dictionray m = {}, all information will be cleared in this way.

If you want to clear meta data for data protection / data security reasons, please make sure you save your PDF to a new file using save option garbage. This makes sure the old information is physically removed from the file (incremental save does not do that).

If you want to change selected values only (and keep others), take a modified doc.metadata and directly use it as a parameter. PDF format and encryption keys present in m will be silently ignored.

Except for the dates keys (must be strings), any unicode value is acceptable. See section PDF String Handling.

The examples directory contains a pair of utilities, csv2meta.py and meta2csv.py, which export / import metadata to / from a csv file.

XML Metadata

Apart from standard metadata, XML-based metadata are supported since PDF version 1.4. PDF maintenance software often uses this feature to store more complex information than is possible with standard metadata.

PyMuPDF contains no XML processing logic and therefore does not directly support maintaining such data. However, you can delete, extract and replace XML metadata (currently, no support inserting new XML metadata).

  • Document._delXmlMetadata() delete XML metadata (if any, no exception raised). Can be used to enhance data privacy or reduce file size.
  • Use xref = Document._getXmlMetadataXref() to get the xref number (int) of XML metadata. If zero, none exist.
  • Use data = Document._getXrefStream(xref) to retrieve the data (a bytes object). Then interpret or change these data with a package like lxml.
  • Use Document._updateStream(xref, data) to update the metadata.

Maintaining Bookmarks

Bookmarks or outlines form a quite complex forward-backward chained set of objects in PDFs. Together they are known as table of contents (TOC).

A TOC structure as found in books is much simpler: it just contains a list of lines with titles, page references and hierarchy levels. Relationship between such lines is only implicitly established by their sequence of occurrence.

Maintaining a book-like TOC (instead of single, separate bookmark items) is therefore exactly what we have decided to implement in PyMuPDF. Changing anything in a TOC means changing the complete TOC. A TOC will be inserted, changed or deleted as one single item with this function. We believe that this approach meets both, practical requirements and intuitive handling:

  • everyone knows what TOCs in books are and how to use them
  • hierarchy relations between lines in a TOC can simply be expressed by the entry's hierachy level
  • forward / backward relationships between entries are established implicitely by the sequence in which they occur

In addition, previously existing method doc.getToC() already provides an intuitive picture of all document bookmark items of a document in exactly the way described above. So, maintaining a TOC of a PDF could occur in the following simple steps:

  1. toc = doc.getToC(simple = True or False)
  2. Modify toc as required ...
  3. doc.setToC(toc)

In step 3, behind the scenes, a new outline chain will be created using toc to completely replace the old one. If you wish to delete an existing TOC, you can also set toc = [].

If you wish to give a PDF a completely new TOC, provide a list of lists like toc = [[lvl1, title1, page1], [lvl2, title2, page2], ...].

As with meta data above, title entries may be provided using the full unicode character set (see following section).

Example program PDFoutline.py implements all of the above using the wxPython GUI.

A pair of utilities, toc2csv.py and csv2toc.py can be used to export / import a TOC to / from a csv file.

PDF String Handling

Outside document content text, PDF support two sets of character encoding, namely PDFDocEncoding and Unicode (see appendix D of the Adobe manual). Both are now fully implemented in PyMuPDF for use in methods setMetadata() and setToC() in the following way (applies to the above mentioned metadata fields and the TOC title entries):

  • if an entry contains only ASCII characters (ord(c) <= 127), it will be used unchanged / as is;
  • else, any character 127 < ord(c) <= 255 will be replaced by the string \nnn, where nnn is the octal representation of ord(c); the resulting string will be used;
  • else, if a string contains any character with ord(c) > 255, the complete string is encoded using UTF-16BE, prefixed with 0xfeff and this result, converted to its hexadecimal representation, will be used.

Differences and similarities of string handling between Python 2 and Python 3 are covered in the following way:

  • The argument will be decoded with UTF-8.
  • If it was bytes or bytearray, it will be converted to unicode (Python 2 and Python 3)
  • A str in Python 2 will become unicode, a unicode (Python 2) and a str (Python 3) will remain unaffected (i.e. stay unicode).
  • The resulting str / unicode will then be treated as mentioned above.

All of the above results in a considerable flexibility: metadata and title fields can be provided as strings, unicode, bytes or bytearray objects!

Clone this wiki locally