Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Advanced usage

PunKeel edited this page Apr 19, 2017 · 2 revisions

Advanced usage

DocBleach is split into multiple modules: an API, a CLI and different bleaches.

API

The Java API is available in api/src/main/java/xyz/docbleach/api. Maven packages are published on OSS Sonatype.

This API allows you to define your own Bleaches, by implementing xyz.docbleach.api.bleach.Bleach. As of today (2017-04-20), DocBleach has not reached a stable state, so ... feel free to play around, but please don't depend on it.

This same API allows you to build apps that use DocBleach, that's what the CLI does.

There are 3 main axes for the API:

  • BleachSession. Defines a "session", the state of the sanitation process. Stores the different threats, the actions taken.
  • Bleach. Defines a sanitiser, a class that accepts an InputStream, a BleachSession and writes a sanitised content in an OutputStream.
  • Threat. Along with ThreatSeverity and ThreatType, defines a threat: a bad content in the file.

As an app developer depending on DocBleach, the minimal code required is:

// Define your inputStream and outputStream here
BleachSession session = new BleachSession();
new DefaultBleach().sanitize(inputStream, outputStream, session);

DefaultBleach is a magical Bleach. It discovers the available bleaches, thanks to the ServiceLoader. For instance, the OLE2 bleach is defined in module/module-office/src/main/resources/META-INF/services/xyz.docbleach.api.bleach.Bleach.


Command Line Interface

The easiest way to use DocBleach is thru the java app:

$ java -jar docbleach.jar -in ./original.pdf -out ./sane.pdf
WARN Sanitized file has been saved, 2 potential threat(s) removed.

Verbosity

It is possible to get a more verbose output by adding -v or -vv.

$ java -jar docbleach.jar -in ./original.pdf -out ./sane.pdf -v
[main] DEBUG xyz.docbleach.cli.Main - Log Level: DEBUG
[main] DEBUG xyz.docbleach.cli.Main - Checking input name : ./original.pdf
[main] DEBUG xyz.docbleach.cli.Main - Checking output name : ./sane.pdf
[main] DEBUG xyz.docbleach.modules.pdf.PdfBleach - Password was guessed: 'null'
[main] DEBUG xyz.docbleach.modules.pdf.PdfBleach - No AcroForms found
[main] DEBUG xyz.docbleach.modules.pdf.PdfBleach - Found and removed Additionnal Actions
[main] DEBUG xyz.docbleach.modules.pdf.PdfBleach - Found and removed Additionnal Actions
[main] WARN xyz.docbleach.cli.Main - Sanitized file has been saved, 2 potential threat(s) removed.

The first column ([main]) displays the thread name, second one the log level, third one the Java Class that generated this log line.

Advanced -in and -out

If you pass a dash (-) as argument to -in, the STDIN will be taken as input.

$ java -jar docbleach.jar -in - -out ./sane.pdf < ./original.pdf 
...

If you pass a dash (-) as argument to -out, the sanitised file will be output in STDOUT.

$ java -jar docbleach.jar -in original.pdf -out - > ./sane.pdf 
...

It is possible to combine these two tweaks, giving ugly command lines. 👎

$ java -jar docbleach.jar -in - -out - < original.pdf > ./sane.pdf 
...

This also allows you to curl documents directly into DocBleach:

$ curl https://----/document.pdf | java -jar docbleach.jar -in - -out - > ./sane.pdf 
...

JSON Output

The default format is meant for humans. DocBleach is able to output a JSON object containing all the useful informations of the bleach process, using the -json toggle.

$ java -jar docbleach.jar -in ./original.pdf -out ./sane.pdf -json
{"threats":[{"type":"ACTIVE_CONTENT","severity":"HIGH","location":"?","details":"Additional Actions","action":"REMOVE"},{"type":"ACTIVE_CONTENT","severity":"HIGH","location":"?","details":"Additional Actions","action":"REMOVE"}]}

This json output is sent to STDERR. You may redirect it to STDIN, and pass it to other commands (like, jq):

$ java -jar docbleach.jar -in ./original.pdf -out ./sane.pdf -json 2>&1

Warning! When using -out - and -json, place the 2>&1 redirection before the > ./sane.pdf. If you don't, both the file and the json output will be sent in sane.pdf.

✅ Good:

$ java -jar docbleach.jar -in ./original.pdf -out - -json 2>&1 > ./sane.pdf
{.....}

❌ Bad:

$ java -jar docbleach.jar -in ./original.pdf -out - -json > ./sane.pdf 2>&1 
(no output)

Enjoy! 😄

Clone this wiki locally