-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform on-host conversion for the pixels to PDF stage #748
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty incredible. Congrats! 🥳 A lot of work went before this and now this feels like the cherry on top. I have some minor code improvement suggestions.
What I still have to do:
- test on windows and macOS
Other observations:
- thanks for removing the dead code!
- the GUI code is crashing on me. I think the latest PySide6 version on pypi is broken.
- ubuntu focal - how do we solve lack of support? PyMuPDF in a virtualenv?
- dummy can have
pixels_to_pdf
removed - Progress text improvements: because the conversion to PDF is now native and so fast, maybe we could replace the two log lines by one saying "converting page X". And when OCR is used we could say "making page X searchable"
ae9090d
to
8884cb8
Compare
I'll reply to some of your observations as well:
In my Fedora 39 dev environment, the GUI seems to work. Can you provide the error log?
I was thinking of either reusing PyMuPDF within the container, or using Tesseract just for Ubuntu Focal. I'll let you know.
Yeap, you're right.
Yeap, you're right. |
da0dd54
to
10522c2
Compare
I worked on this. The code is in the branch On macOS it seems to be failing but I haven't had time to investigate. If you have the chance before me, feel free to continue where I left @apyrgio. |
48eba2b
to
4d70bd9
Compare
8f918c8
to
3125a59
Compare
ef45fb4
to
1302a1f
Compare
The PR is ready for review once more. The commit messages may require a bit more ❤️ and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work Alex! I've tested the branch locally and it works (macOS m1), congrats 👍 🎉
Additionally to the review comments I left inline, I believe we could check that the tesseract data is present before asking PyMuPDF to use it, disabling this behavior if not present. Right now, it fails if not installed (which should not happen, but I believe it's the right timing to disable this).
I see two ways of doing this:
- Show a warning next to the OCR setting, mentioning that the tesseract data is not installed (for the selected language?)
- If no tesseract data is detected, remove the OCR setting and put a warning instead.
Add a Python script that can run in all supported platforms, and can download and extract the Tesseract language data from GitHub, while also: 1. Checking that the expected hash matches. 2. Informing the user if the language data have already been downloaded. 3. Extracting only the subset of language data that Dangerzone needs
Add a new way to detect where the Tesseract data are stored in a user's system. On Linux, the Tesseract data should be installed via the package manager. On macOS and Windows, they should be bundled with the Dangerzone application. There is also the exception of running Dangerzone locally, where even on Linux, we should get the Tesseract data from the Dangerzone share/ folder.
The PyMuPDF package was previously mainly used within the Dangerzone container, as well as on Qubes. With on-host conversion, PyMuPDF will be used in all supported platforms by default. For this reason, we can promote it to a main dependency.
Update .deb/.rpm specs to include PyMuPDF as a required package.
Extend the base isolation provider to immediately convert each page to a PDF, and optionally use OCR. In contract with the way we did things previously, there are no more two separate stages (document to pixels, pixels to PDF). We now handle each page individually, for two main reasons: 1. We don't want to buffer pixel data, either on disk or in memory, since they take a lot of space, and can potentially leave traces. 2. We can perform these operations in parallel, saving time. This is more evident when OCR is not used, where the time to convert a page to pixels, and then back to a PDF are comparable.
Move the logic for grabbing debug logs to a new place, now that we have merged the two conversion stages (doc to pixels, pixels to PDF).
Make the Dummy isolation provider follow the rest of the isolation providers and perform the second part of the conversion on the host. The first part of the conversion is just a dummy script that reads a file from stdin and prints pixels to stdout.
6b65881
to
03b3c9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… and we're good to go on this one, congrats 🙌🏼
This PR introduces a fundamental change in the way Dangerzone processes documents. Instead of first grabbing all of the pixel data from the first container, storing them on disk, and then reconstructing the PDF on a second container, Dangerzone now immediately reconstructs the PDF on the host, while the doc to pixels conversion is still running on the first container. The sanitzation is no less safe, since the boundaries between the sandbox and the host are still respected.
What we gain is that we no longer use mounts, and we have much faster conversions, especially on Windows and macOS.
Fixes #625
Note
This PR still has some rough edges. Off the top of my head, we need to:
Removetool.poetry.group.container.dependencies
section frompyproject.toml
, as it's duplicated info.--userns keep-id
option in Podman.donwload-tessdata.py
cacheable in our CI runs.share/tessdata
in our .debs / .rpms.ARCHITECTURE.md
, which will be the source of truth on how Dangerzone works now.All these cannot be tackled in a single PR, but we at least need to have issues for the ones we won't tackle immediately, before merging this PR.