Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Office Documents

PunKeel edited this page Mar 29, 2017 · 1 revision

How are Office Documents sanitized?

There are two kinds of Office Documents: OLE2 and OOXML, aka "2007+". The two formats are totally different, and we treat them this way.

The parsing, and writing, is done by Apache POI.

OLE2 - the original format

The first format used by Office Documents is OLE, meaning "Object Linking and Embedding". It is a binary format that is able to embed objects (hence the name) including OLE2 objects. The structure, once parsed, looks like a file system: the header of the file tells us the paths, sizes and locations of blocks of data. Because of this, it is not possible to change the file in place: it would mean rewriting the header, and all its pointers. Instead, we read the file and copy what we want to preserve.

OLE2 files usually have a simple and short extension, like .doc, .ppt, .xls

Word and other Office apps use a convention to store dynamic contents: all the macros are in a "directory" named "Macros". We filter it.

That's all. Neither hard or strange.

OOXML - the Open format

According to Wikipedia:

Office Open XML (also informally known as OOXML or OpenXML) is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents.

OOXML is the default format for files created with a 2007+ Office Software. The file extension usually ends with x or m: .docx, .docm, .docxm, and the same goes for .xlsx and .pptx.

Two interesting properties are present in this format: a list of elements and a list of their relationships. Using the list of elements, we know exactly what is what: a macro has to tell us it is a macro. It makes it very easy to detect and remove them. The relationships are used to cleanup dependencies, and maybe improve the sanitation process in the future.

This approach is the one Microsoft recommends in its knowledge-base. See Developing Solutions Using the Office XML Formats > Document Security in Introducing the Office (2007) Open XML File Formats.


DocBleach might break Macros and ActiveX objects, and is the intended behavior.

Clone this wiki locally