-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On-host pixels to PDF conversion #625
Comments
On PackagingWindowsThe official Tesseract site points at the Mannheim University Library for the latest installers on Tesseract. These installers do not seem to be a solution on their own, because they are provided a by third-party, and extracting a single Tesseract binary from them seems not trivial. MacOSThe official Tesseract site suggests using either MacPorts or HomeBrew. The problem with these approaches is that Dangerzone is a package that can be installed on its own, as a DMG. So, we can't expect that users will have HomeBrew or MacPorts enabled. Building from SourceIf we want to build Tesseract from source, we should have in mind that we probably need to build a standalone binary (i.e., without shared libraries). There have been attempts to create standalone binaries, but they are not straight-forward as well. |
We had a breakthrough in this front. Turns out that installing Tesseract-OCR is not a prerequisite for using the OCR capabilities of PyMuPDF. We were mistaken that this was the case because PyMuPDF specified for these functions that "... Will fail if Tesseract is not installed". Also, one more thing that threw us off is that other projects like PyTesseract require a Tesseract installation. |
I added one more topic to the list. |
Added some extra tasks:
|
Design overviewIn this section, we'll explain how we plan to implement the on-host conversion feature. As a reminder, here's in pseudocode the current conversion flow: p = start_first_container()
p.stdin.write(document)
d = create_tmp_dir()
for (width, height, pixels) in p.stdout:
# Write width, height, and pixel data as individual files in tmp dir
p = start_second_container() # Also mount the temporary directory
p.wait()
# Move safe PDF from tmp dir to the target path You can see that the container for the first phase starts, and converts the pages to pixels. Then the second container starts, find the pixels as mounted files, and produces a safe PDF We plan to change this flow as follows: p = start_first_container()
safe_pdf = create_empty_pdf()
p.stdin.write(document)
for (width, height, pixels) in p.stdout:
pixmap = create_pymupdf_pixmap(width, height, pixels)
if ocr_lang:
page = ocr_page(pixmap, ocr_lang)
else:
page = convert_pixmap_to_pdf(pixmap)
safe_pdf.insert_pdf(page)
# Write safe PDF to the target path In this case, the first container starts, and remains running for the whole duration of the conversion. For every page that we get from the container's stdout, we convert it to a PyMuPDF pixmap. Then, we convert it to a single-page PDF, either by OCRing it, or by converting the pixels to PNG and then to PDF. Finally, we build a safe PDF by inserting sequentially each page in it. |
The Debian situationWhile implementing the above solution on #748, we realized that of all platforms, we have a big problem in Debian-based ones specifically. Here's a list of issues we've encountered so far:
Approaches we consideredWe have to sidestep this issue, and to do so, we thought of and evaluated several different approaches. We'll list below the ones we dropped, and explain why: In Rust we trust?If PyMuPDF cannot solve everything for us across operating systems, then could Rust help here? We could create a Rust library, importable via PyO3, and built for our three OSes (Windows, macOS, Linux). The functionality of the Rust library would be very trivial. Just take the pixel data, convert them to PNG, compress them, and iteratively build a PDF from each page. That's super easy to do. However, the main problem here is that we didn't find any Rust library that can OCR a PNG and produce a PDF, without calling Tesseract under the hood. Statically compiling Tesseract in our Rust library is a mandatory requirement, else it's too much of a maintainance hassle to use it on Windows / macOS systems which don't have Tesseract installed (see rationale in #625 (comment)). Provide our own PyMuPDF package?If the OS-provided PyMuPDF package is not enough for our needs, we could build our own from a version that works, and provide it from our repo. That's what we've done for the The problem here is that unlike Vendor PyMuPDF from PyPI in our package?Another solution is to copy the contents of the PyMuPDF wheel from PyPI into a source dir of Dangerzone, solely for Debian-based distros, and use PyMuPDF from there. Unlike the Debian package route, this route will not affect other users, and we can maintain full control over this package. The problem with this approach is that vendoring Python packages is usually against Debian's policy for inclusion in the official repos (think the Use
|
I am curious why this is just a Debian/Ubuntu issue and not Fedora? Is their PyMuPDF package setup differently/better?
The section in Debian policy is https://www.debian.org/doc/debian-policy/ch-source.html#embedded-code-copies, which uses "should" to discourage vendoring but doesn't outright ban it. I think there needs to be a strong justification for it (✔️) and that we have a plan for handling any security/etc. updates that are needed for it, which I assume we'd be able to do if we went down this route. Also, earlier you said:
Which I think is at odds with the goal of having official Debian packages. Because every new version wouldn't get pushed to all supported Debian releases, it would just go into unstable, with stable/oldstable staying the same. So you still only need to meet the latest version of the PyMuPDF API, and not all the way back until bullseye. |
Fedora enables the OCR feature in their PyMuPDF package, whereas Debian doesn't, which is an important distinction. Furthermore, the PyMuPDF packages for Fedora 39+ are > 1.22, and therefore we haven't encountered segfaults there. On the other hand, several Debian/Ubuntu releases we support offer PyMuPDF < 1.22, and we encounter segfaults there. Maybe there's a patch that distros can backport, but I haven't managed to reach to a conclusion about the nature of these errors.
Thanks for the link, I didn't know the exact wording for this policy. I do see "should" there, but I see some other policies as well that we need to circumvent as well:
But if you think that our use case can be an exception, then I'll seriously consider this option.
Yeah, I agree in principle. My argument though in that specific section was not about what would happen if the PyMuPDF API changes again in the future. I was talking about the current situation. We ship Dangerzone to the following Debian-based distros:
We cannot build a PyMuPDF package that is both > 1.22 (e.g., like the one in Debian Trixie), and can also be used without API incompatibilities in Ubuntu Focal. In any case relying on such a package is a no-no for inclusion in Debian, so I don't think we can go down that route. |
Getting bug reports for this would be good :)
I wouldn't immediately count it out, I think it would need some level of input from the Debian security team and demonstration that the PyMuPDF package is deficient and unfixable. The former is clear and can be demonstrated via bug reports, but that it's unfixable seems up in the air.
I think the two goals of "Support focal and trixie with identical packages" and "Get official packages into Debian" are at odds with each other and we/you probably need to compromise somewhere. Maybe something like using the package for trixie and for the older versions that won't end up in Debian proper, vendoring PyMuPDF. |
Gotcha, you're right. It's best to have a reproducible example for these segfaults. I didn't do so yet, because they are fixed in newer versions, but maybe Debian devs are interested in it.
Oh wait, there's something important here that informs whatever decision we take on this subject. One of the "end goals" is to make Dangerzone eligible for landing on Debian Bookworm, and therefore Tails. So, I'm aiming for a proper Debian package not just for Debian Trixie, but for Bookworm as well. What's your opinion on the latter, do you think it's possible if we package PyMuPDF on our own? |
Getting a package into bookworm at this point isn't possible at all. Once it's in trixie, we can add it to bookworm-backports, but I'm not sure if Tails has backports enabled. |
Ok, it's good to know that. My understanding is that Tails can work with Bookworm backports (see onionshare issue), but that it can also backport a Debian package into their repos, so I guess that's a viable route as well. So, we an alternative path forward is:
This way, the PyMuPDF versions that we support are fairly recent, and their API should be consistent across all platforms. Plus, we won't have to use the container at all for the second stage of the conversion, although I must stress again that this may be a limiting factor in the long run. I have just a few concerns:
Kunal, thanks a bunch for your input. If the above plan makes sense to you (from the perspective of a Debian dev), then I'll give it a shot. |
Yep, overall that sounds good to me! Maybe explicitly "Vendor PyMuPDF for all Debian-based distros until the distro-provided package meets our needs"?
Exactly. Ubuntu will grab the latest version in testing and include it in the current development distro, and continue picking up changes until the Ubuntu release is frozen and shipped.
You're welcome :) |
Most def. Regarding package vendoring, it's quite simple in the case of Dangerzone. What we need to do is:
That's it. From there on, we can do |
I sent a Debian bug report for the lack of OCR support in PyMuPDF: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1082023 (with a typo, ugh). |
Install PyMuPDF under ./dangerzone/vendor, right before we build the .deb package. We vendor PyMuPDF just for Debian, since the provided versions don't have OCR support enabled. Currently, we don't use PyMuPDf on the host, but this will change once we fully implement the on-host conversion feature. Refs #625
Install PyMuPDF under ./dangerzone/vendor, right before we build the .deb package. We vendor PyMuPDF just for Debian, since the provided versions don't have OCR support enabled. Currently, we don't use PyMuPDf on the host, but this will change once we fully implement the on-host conversion feature. Refs #625
Install PyMuPDF under ./dangerzone/vendor, right before we build the .deb package. We vendor PyMuPDF just for Debian, since the provided versions don't have OCR support enabled. Currently, we don't use PyMuPDf on the host, but this will change once we fully implement the on-host conversion feature. Refs #625
Install PyMuPDF under ./dangerzone/vendor, right before we build the .deb package. We vendor PyMuPDF just for Debian, since the provided versions don't have OCR support enabled. Currently, we don't use PyMuPDf on the host, but this will change once we fully implement the on-host conversion feature. Refs #625
Install PyMuPDF under ./dangerzone/vendor, right before we build the .deb package. We vendor PyMuPDF just for Debian, since the provided versions don't have OCR support enabled. Currently, we don't use PyMuPDf on the host, but this will change once we fully implement the on-host conversion feature. Refs #625
Install PyMuPDF under ./dangerzone/vendor, right before we build the .deb package. We vendor PyMuPDF just for Debian, since the provided versions don't have OCR support enabled. Currently, we don't use PyMuPDf on the host, but this will change once we fully implement the on-host conversion feature. Refs #625
Background
Historically on the containers version of Dangerzone the conversion happens on a second container. This was needed since Dangerzone relied on many linux-native programs for conversion such as GraphicsMagic, ghostscript (for compression via
ps2pdf
). In Qubes the conversion from pixels to PDF already happens on the host.Suggestion
Allowing the second phase of the conversion to run in the host opens up several improvements for Dangerzone:
(partial capture, taken from #658 (comment))
Let's explain how the on-host conversion feature brings the above improvements:
Host container image independently
One of the reasons why the Dangerzone container image is not hosted in a container registry like DockerHub is because we don't have a way to verify its contents during image pull. Normally, this would not be an issue for Dangerzone, since it already restricts the capabilities and privileges of the container. Therefore, we could treat a malicious image similarly as a container that has been pwned by a malicious document.
One problem here is that the same container image is used for the second stage of the conversion. A malicious container image in that stage would cause real damage, because its output is a PDF document, not pixels.
If we performed the pixels to PDF phase on the host instead, this would mean that the container image would be used solely for the first phase, and thus we could potentially push it to an untrusted image registry.
Benefit: This could open the door for updating the container image independently from the app, which would be a massive security improvement for Dangerzone.
Note
That was the original thinking at least. Since creating the above diagram, we now run gVisor within the container image. This means that a malicious container registry can serve an image that does not contain gVisor, and significantly weaken our defenses. The resulting container will still be restricted by our security flags, but the truth is that the original point is no longer that strong.
Smaller container image
By performing the pixels to PDf phase on host, we can remove the dependency to the OCR models from our image, which are currently used by PyMuPDF. These models take a lot of space (roughly ~300MiB), and removing them can slim down our image considerably.
As for where they will go, the plan is to move them to the host instead. This means will either make them dependencies of the .deb/.rpm packages, or include them in the Windows/macOS installers.
Benefit: This would make our Debian/Fedora packages slimmer, which is important for getting included in the official repos, and in Tails (see #669)
Dynamic loading of OCR models
If we move the OCR models to the host, we can then experiment with downloading them dynamically, when a user attempts to use a specific language.
Benefit: This results in size improvements, since we would further slim down the Windows/macOS installers, and the packages the users need to download in Linux.
Besides the above benefits, this feature opens the room for some extra improvements, namely:
Tasks
share/
for inclusion on Windows / MacOS, as well as when we do development (see Perform on-host conversion for the pixels to PDF stage #748).Add OCR language data for all languages as optional dependencies in our Linux packages (.deb/.rpm)(will be tackled in a separate PR)The text was updated successfully, but these errors were encountered: