Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dangerzone may exclude the last page of a document #560

Closed
apyrgio opened this issue Sep 26, 2023 · 0 comments · Fixed by #561
Closed

Dangerzone may exclude the last page of a document #560

apyrgio opened this issue Sep 26, 2023 · 0 comments · Fixed by #561
Assignees
Labels
bug Something isn't working container
Milestone

Comments

@apyrgio
Copy link
Contributor

apyrgio commented Sep 26, 2023

Background

Dangerzone uses a tool called pdftoppm in order to convert a PDF document into pixels:

await self.run_command(
[
"pdftoppm",
pdf_filename,
page_base,
"-progress",
],

The way we use pdftoppm is the following:

  1. Provide the PDF document as an argument.
  2. Provide a temporary directory where pdftoppm will write PPM files (one per page).
  3. Ask pdftoppm to report progress in stderr. Progress reports are lines with the following format: <page> <num_pages> <file>
  4. Register a callback handler for the stderr of pdftoppm, which will be called with each line of the progress report as an argument.
  5. (callback handler) For each progress line, read the file with the PPM data and get the following info: page width, page height, RGB data
  6. (callback handler) Write this data as separate files under /tmp/dangerzone, which is the directory that the second phase will use to convert pixels to PDF.

Bug 🐞

Let's dig deeper into how the callback handler is called. We have a generic function that reads lines from a command's output:

while True:
line = await sr.readline()
if sr.at_eof():
break
self.captured_output += line
if callback is not None:
callback(line)
buf += line
return buf

For each line, it appends it to a buffer, and then calls a callback function with the line as an argument.

What's the bug here? We first read the line, and then check if we reached EOF 🤦. So, it's possible that we read the last line of the stream, and then immediately discard it, because we have indeed reached EOF. This means that the callback handler will not be called, and we will not create the necessary files under /tmp/dangerzone for the last page.

Impact

This bug was introduced in version 0.4.1 (aeeed41). Users of affected versions may have documents with the last page missing. This bug should not have any impact on the security of the sanitization.

During our QA we never stubled into this bug, and we don't have any report from our users hinting to such an issue. I only managed to trigger it today, while working on something that made the callback handler run twice as slow.

If you have been affected though, please let us know. In any case, we will fix this issue ASAP.

Remediation

Change the order of the checks: first check if we are at EOF, and then read the line. Note that we can't check the output of readline() for EOF (i.e, if line == ""), because it will detect empty lines as EOF.

The fix for this bug will be included in the upcoming 0.5.0 release.

@apyrgio apyrgio self-assigned this Sep 26, 2023
@apyrgio apyrgio added bug Something isn't working container labels Sep 26, 2023
@apyrgio apyrgio added this to the 0.5.0 milestone Sep 26, 2023
apyrgio added a commit that referenced this issue Sep 26, 2023
Do not read a line from the command output and then check if
we are at EOF, because it's possible that the writer immediately exited
after writing the last line of output. Instead, switch the order of
actions.

This is a very serious bug that can lead to Dangerzone excluding the
last page of the document. It should have bit us right from the start
(see aeeed41), but it seems that the
small period of time it takes the kernel to close the file descriptors
was hiding this bug.

Fixes #560
apyrgio added a commit that referenced this issue Sep 27, 2023
Add a sanity check at the end of the conversion from doc to pixels, to
ensure that the resulting document will have the same number of pages as
the original one.

Refs #560
apyrgio added a commit that referenced this issue Sep 28, 2023
Do not read a line from the command output and then check if
we are at EOF, because it's possible that the writer immediately exited
after writing the last line of output. Instead, switch the order of
actions.

This is a very serious bug that can lead to Dangerzone excluding the
last page of the document. It should have bit us right from the start
(see aeeed41), but it seems that the
small period of time it takes the kernel to close the file descriptors
was hiding this bug.

Fixes #560
apyrgio added a commit that referenced this issue Sep 28, 2023
Add a sanity check at the end of the conversion from doc to pixels, to
ensure that the resulting document will have the same number of pages as
the original one.

Refs #560
apyrgio added a commit that referenced this issue Sep 28, 2023
Add a sanity check at the end of the conversion from doc to pixels, to
ensure that the resulting document will have the same number of pages as
the original one.

Refs #560
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working container
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant