Perform on-host conversion for the pixels to PDF stage #748

apyrgio · 2024-03-14T11:37:27Z

This PR introduces a fundamental change in the way Dangerzone processes documents. Instead of first grabbing all of the pixel data from the first container, storing them on disk, and then reconstructing the PDF on a second container, Dangerzone now immediately reconstructs the PDF on the host, while the doc to pixels conversion is still running on the first container. The sanitzation is no less safe, since the boundaries between the sandbox and the host are still respected.

What we gain is that we no longer use mounts, and we have much faster conversions, especially on Windows and macOS.

Fixes #625

Note

This PR still has some rough edges. Off the top of my head, we need to:

Test the changes across all of our supported platforms, and fix all of our CI errors.
~~Remove tool.poetry.group.container.dependencies section from pyproject.toml, as it's duplicated info.~~
- Actually, it still has its uses
Remove --userns keep-id option in Podman.
Make donwload-tessdata.py cacheable in our CI runs.
Turn OCR language deps into recommendations in Linux systems, and handle if some are not installed.
Improve our Dummy isolation provider, so that the steps that run in the host actually run in our Windows / macOS CI runners.
Update our packaging logic so that we don't include share/tessdata in our .debs / .rpms.
Update our wording in various places, so that we no longer refer to using two containers for the sanitization.
Draft an ARCHITECTURE.md, which will be the source of truth on how Dangerzone works now.

All these cannot be tackled in a single PR, but we at least need to have issues for the ones we won't tackle immediately, before merging this PR.

install/linux/dangerzone.spec

deeplow

This is pretty incredible. Congrats! 🥳 A lot of work went before this and now this feels like the cherry on top. I have some minor code improvement suggestions.

What I still have to do:

test on windows and macOS

Other observations:

thanks for removing the dead code!
the GUI code is crashing on me. I think the latest PySide6 version on pypi is broken.
ubuntu focal - how do we solve lack of support? PyMuPDF in a virtualenv?
dummy can have pixels_to_pdf removed
Progress text improvements: because the conversion to PDF is now native and so fast, maybe we could replace the two log lines by one saying "converting page X". And when OCR is used we could say "making page X searchable"

dangerzone/isolation_provider/base.py

dangerzone/conversion/common.py

dangerzone/isolation_provider/base.py

dangerzone/conversion/pixels_to_pdf.py

dangerzone/isolation_provider/container.py

dangerzone/isolation_provider/base.py

apyrgio · 2024-03-27T15:03:25Z

I'll reply to some of your observations as well:

the GUI code is crashing on me. I think the latest PySide6 version on pypi is broken.

In my Fedora 39 dev environment, the GUI seems to work. Can you provide the error log?

ubuntu focal - how do we solve lack of support? PyMuPDF in a virtualenv?

I was thinking of either reusing PyMuPDF within the container, or using Tesseract just for Ubuntu Focal. I'll let you know.

dummy can have pixels_to_pdf removed

Yeap, you're right.

Progress text improvements: because the conversion to PDF is now native and so fast, maybe we could replace the two log lines by one saying "converting page X". And when OCR is used we could say "making page X searchable"

Yeap, you're right.

install/linux/dangerzone.spec

deeplow · 2024-03-28T17:23:34Z

Update our packaging logic so that we don't include share/tessdata in our .debs / .rpms.

I worked on this. The code is in the branch 625-host-stream-tessdata-packaging. A lot of stuff had to be moved and I didn't manage to finish testing this week. I tested on fedora and debian and it seems to be building fine. The only thing is that it includes the .gitkeep in share/container.

On macOS it seems to be failing but I haven't had time to investigate. If you have the chance before me, feel free to continue where I left @apyrgio.

stdeb.cfg

dangerzone/isolation_provider/base.py

dangerzone/util.py

apyrgio · 2024-10-08T18:10:43Z

The PR is ready for review once more. The commit messages may require a bit more ❤️ and make lint complains, but other than that, it's as ready and tested as it can be.

almet

Awesome work Alex! I've tested the branch locally and it works (macOS m1), congrats 👍 🎉

Additionally to the review comments I left inline, I believe we could check that the tesseract data is present before asking PyMuPDF to use it, disabling this behavior if not present. Right now, it fails if not installed (which should not happen, but I believe it's the right timing to disable this).

I see two ways of doing this:

Show a warning next to the OCR setting, mentioning that the tesseract data is not installed (for the selected language?)
If no tesseract data is detected, remove the OCR setting and put a warning instead.

.github/workflows/ci.yml

dangerzone/isolation_provider/base.py

dangerzone/isolation_provider/dummy.py

dangerzone/util.py

install/common/download-tessdata.py

install/linux/vendor-pymupdf.py

tests/isolation_provider/base.py

Add a Python script that can run in all supported platforms, and can download and extract the Tesseract language data from GitHub, while also: 1. Checking that the expected hash matches. 2. Informing the user if the language data have already been downloaded. 3. Extracting only the subset of language data that Dangerzone needs

Add a new way to detect where the Tesseract data are stored in a user's system. On Linux, the Tesseract data should be installed via the package manager. On macOS and Windows, they should be bundled with the Dangerzone application. There is also the exception of running Dangerzone locally, where even on Linux, we should get the Tesseract data from the Dangerzone share/ folder.

The PyMuPDF package was previously mainly used within the Dangerzone container, as well as on Qubes. With on-host conversion, PyMuPDF will be used in all supported platforms by default. For this reason, we can promote it to a main dependency.

Update .deb/.rpm specs to include PyMuPDF as a required package.

Extend the base isolation provider to immediately convert each page to a PDF, and optionally use OCR. In contract with the way we did things previously, there are no more two separate stages (document to pixels, pixels to PDF). We now handle each page individually, for two main reasons: 1. We don't want to buffer pixel data, either on disk or in memory, since they take a lot of space, and can potentially leave traces. 2. We can perform these operations in parallel, saving time. This is more evident when OCR is not used, where the time to convert a page to pixels, and then back to a PDF are comparable.

Move the logic for grabbing debug logs to a new place, now that we have merged the two conversion stages (doc to pixels, pixels to PDF).

Make the Dummy isolation provider follow the rest of the isolation providers and perform the second part of the conversion on the host. The first part of the conversion is just a dummy script that reads a file from stdin and prints pixels to stdout.

almet

… and we're good to go on this one, congrats 🙌🏼

Comments have been adressed :-)

apyrgio mentioned this pull request Mar 14, 2024

Sandbox all document processing in gVisor #590

Merged

deeplow reviewed Mar 14, 2024

View reviewed changes

install/linux/dangerzone.spec Outdated Show resolved Hide resolved

deeplow previously requested changes Mar 14, 2024

View reviewed changes

apyrgio force-pushed the 625-host-stream branch 2 times, most recently from ae9090d to 8884cb8 Compare March 27, 2024 12:24

deeplow reviewed Mar 28, 2024

View reviewed changes

install/linux/dangerzone.spec Show resolved Hide resolved

apyrgio force-pushed the 625-host-stream branch 5 times, most recently from da0dd54 to 10522c2 Compare March 28, 2024 16:29

apyrgio force-pushed the 625-host-stream branch 3 times, most recently from 48eba2b to 4d70bd9 Compare March 28, 2024 17:59

apyrgio mentioned this pull request Apr 1, 2024

OSError: [Errno 39] Directory not empty: 'pixels' when aborting during doc to pixels stage #759

Closed

apyrgio mentioned this pull request Apr 9, 2024

Handle various termination scenarios of the conversion process #772

Merged

deeplow reviewed Apr 15, 2024

View reviewed changes

stdeb.cfg Outdated Show resolved Hide resolved

apyrgio mentioned this pull request Apr 18, 2024

pixels-to-pdf failed #781

Closed

apyrgio mentioned this pull request May 22, 2024

Catch out of RAM errors in client and server #578

Closed

apyrgio added this to the 0.7.0 milestone Jun 3, 2024

almet reviewed Jun 5, 2024

View reviewed changes

dangerzone/isolation_provider/base.py Outdated Show resolved Hide resolved

almet reviewed Jun 5, 2024

View reviewed changes

dangerzone/util.py Outdated Show resolved Hide resolved

apyrgio force-pushed the 625-host-stream branch from 4d70bd9 to c69feba Compare June 11, 2024 17:01

almet removed this from the 0.7.0 milestone Jun 12, 2024

apyrgio force-pushed the 625-host-stream branch 2 times, most recently from 8f918c8 to 3125a59 Compare June 17, 2024 16:48

apyrgio mentioned this pull request Aug 8, 2024

GUI v2: MVP #894

Open

12 tasks

eloquence mentioned this pull request Aug 19, 2024

Update "How it works" section and add some articles about Dangerzone freedomofpress/dangerzone.rocks#39

Merged

apyrgio force-pushed the 625-host-stream branch from ef45fb4 to 1302a1f Compare October 8, 2024 16:17

almet reviewed Oct 9, 2024

View reviewed changes

almet mentioned this pull request Oct 9, 2024

Put dev scripts into their own python module #946

Open

almet mentioned this pull request Oct 17, 2024

Catch installation errors and display them. #952

Merged

apyrgio added 19 commits October 17, 2024 15:33

ci: Be explicit about the Debian package we install in end-user envs

fba009a

Better way to collect tests

bc58b78

Provide sanitized version of output filename

5bba249

Update build instructions

ffcf664

ci: Add GitHub action for tessdata

477bdfc

Ignore tesseract data when building DEB/RPM packages

d1e1194

Make PyMuPDF a main Dangerzone dependency

57475b3

The PyMuPDF package was previously mainly used within the Dangerzone container, as well as on Qubes. With on-host conversion, PyMuPDF will be used in all supported platforms by default. For this reason, we can promote it to a main dependency.

Update .deb/.rpm dependencies

08f5ef6

Update .deb/.rpm specs to include PyMuPDF as a required package.

Update the way we get debug logs

f42bb23

Move the logic for grabbing debug logs to a new place, now that we have merged the two conversion stages (doc to pixels, pixels to PDF).

Remove dead code

7ea7c8a

Remove dead docs

703bb0e

tests: Remove provider_wait fixtures

1ca867c

tests: Improve test for top-level conversion errors

4398986

ci: Check OCR in Debian/Fedora tests

0ea8e71

debian: Add Tesseract languages as a dependency

03b3c9e

apyrgio force-pushed the 625-host-stream branch from 6b65881 to 03b3c9e Compare October 17, 2024 12:51

almet approved these changes Oct 17, 2024

View reviewed changes

apyrgio merged commit 03b3c9e into main Oct 17, 2024
90 checks passed

apyrgio deleted the 625-host-stream branch October 17, 2024 13:26

jkarasti mentioned this pull request Oct 29, 2024

Executables built with cx_freeze broken after On-host pixels to PDF conversion PR was merged #974

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform on-host conversion for the pixels to PDF stage #748

Perform on-host conversion for the pixels to PDF stage #748

apyrgio commented Mar 14, 2024 •

edited

Loading

deeplow left a comment

apyrgio commented Mar 27, 2024

deeplow commented Mar 28, 2024

apyrgio commented Oct 8, 2024

almet left a comment

almet left a comment

Perform on-host conversion for the pixels to PDF stage #748

Perform on-host conversion for the pixels to PDF stage #748

Conversation

apyrgio commented Mar 14, 2024 • edited Loading

deeplow left a comment

Choose a reason for hiding this comment

apyrgio commented Mar 27, 2024

deeplow commented Mar 28, 2024

apyrgio commented Oct 8, 2024

almet left a comment

Choose a reason for hiding this comment

almet left a comment

Choose a reason for hiding this comment

apyrgio commented Mar 14, 2024 •

edited

Loading