-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PI: Making pypdf as fast as pdfrw #2086
Conversation
I'm surprised by the low number of failing tests: This is encouraging! 😄 |
There are only 4 remaining tests that fail. They can be executed with: |
I've just merged the comment improvements into I also just ran https://github.com/py-pdf/benchmarks to see the impact:
|
Interesting.
Yes, that's promising! By the way, I was wondering: would you be open to add non-regression unit tests focused on performances @MartinThoma?
That have some drawbacks (those tests will fail when executed on slow computers), |
Yes! We have only two so far:
The Benchmark was intended to show changes, but it seems as if the executing machine in Github CI just behaves wildly differently. Maybe I should execute those for the releases on my machine 🤔 Or maybe there is something one can do to "normalize" the results? |
@pubpub-zz I would appreciate some help here :-) test_extra_test_iss1541This test fails:
No exception is thrown. I don't know if we would want an exception to be thrown there or if it's actually good that we no longer do it. test_writer.py::test_remove_images
That is for sure not good. test_writer.py::test_iss1601
The left-hand side starts with test_writer.py::test_new_removes
|
Can you make an example PR? I think this PR is interesting for three reasons (1) Huge performance improvement in watermarking (2) File size reductions (3) test_extra_test_iss1541 no longer throws an exception ... I'm not sure if that is good though 😄 |
Among the 4 previously failing tests, only But new unit tests are failing:
I'm going to look into them |
While on it, pre-commit was also updated Taken from #2086 Co-authored-by: Lucas Cimon <925560+Lucas-C@users.noreply.github.com>
While on it, pre-commit was also updated + several fixes for mypy. Taken from #2086 Full credit to Lucas for the property-simplification. Co-authored-by: Lucas Cimon <925560+Lucas-C@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first comments
pypdf/generic/_data_structures.py
Outdated
@@ -818,6 +818,20 @@ def _clone( | |||
pass | |||
super()._clone(src, pdf_dest, force_duplicate, ignore_fields) | |||
|
|||
def get_data(self) -> bytes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_data(self) -> bytes: | |
def get_data(self) -> bytes: |
So far from I remember _data does not have the same meaning in an encodedStream (where _data contains the compressed data) and DecodedStream(where the data are clear). raising up the set_data into content stream will leave people think set_data on an encoded stream is valid where the results are not good (some side fields needs to be set also)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far from I remember _data does not have the same meaning in an encodedStream (where _data contains the compressed data) and DecodedStream(where the data are clear)
Interesting. Is this documented somewhere?
Accessing a private property (._data
) from outside the ContentStream
class seemed like a code smell to me,
but if there is a semantic meaning to using .data
instead of get/set_data()
,
I should take care about this.
raising up the set_data into content stream will leave people think set_data on an encoded stream is valid where the results are not good (some side fields needs to be set also)
I do not really understand: EncodedStreamObject.set_data()
existed prior to this PR,
and has been used in several places, right?
But you mention that it should not be used / is not valid?
PageObject._push_pop_gs(original_content, self.pdf) | ||
) | ||
# new_content_stream = PageObject._push_pop_gs(original_content, self.pdf) | ||
new_content_array.append(original_content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the _push_pop_gs make the conversion to a ContentObject where the operators have been analysed. this is most propably the more consuming part of the code,however we d not need to do that : we can just append the new code at the beginning/end of the array (eventually create this array). Don't forget to add the extracontent to the object (required to be indirect ref)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the _push_pop_gs make the conversion to a ContentObject where the operators have been analysed. this is most propably the more consuming part of the code
Exactly! 😊
Don't forget to add the extracontent to the object (required to be indirect ref)
What do you mean by extracontent?
Is something missing with the current code in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo : I mean the extra content(text added to the first page content). What you have to remember is that stream objects that are composing the content must be added list of objects with the _add_object() function and then just store the indirect objects in the ArrayObject that is stored in the /Contents of the page Object.
Personnally, what I would do:
create an array object
copy in the existing streams. if the object is a content object, replace it with a Encoded Stream object(use of _replace_object() function)
insert at the beginning of the first encoded stream "q\n" using the set_data() function
insert at the end of the last stream the "Q\n"
then do what is requrired the Page2 content appending the encodedstream
e4d12ab
to
986f30c
Compare
e1329d6
to
9734bf3
Compare
I fixed In think this PR is ready for being reviewed, and potentially merged. I'm planning to also get a look at improving other parts of |
9734bf3
to
1688837
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #2086 +/- ##
==========================================
- Coverage 94.12% 94.10% -0.03%
==========================================
Files 41 41
Lines 7442 7460 +18
Branches 1471 1474 +3
==========================================
+ Hits 7005 7020 +15
- Misses 272 274 +2
- Partials 165 166 +1
☔ View full report in Codecov by Sentry. |
Awesome work 😲 👍 🙏 I want to give @pubpub-zz another chance to have a look at it, so I'll not merge+release it right now. But, if possible, I would love to do it next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was done
- Remove unnecessary calls to
__parse_content_stream
and_push_pop_gs
. - API changes:
get_data
andset_data
(as well as the deprecated parts) were moved from DecodedStreamObject to StreamObject. Besides that, no changes
How this PR improves pypdf
The performance improvements for watermarking are dastic
Performance impact of this PR:
- Text extraction speed: 2.5s -> 2.2s
- Image extraction speed: 3.0s -> 2.6s
- Watermarking speed: 10.8s -> 0.4s ❗ 🤯 💥 🚀
It also reduces the watermarking file size in some cases:
- https://arxiv.org/pdf/2201.00214.pdf: 2.8 MB -> 2.5MB
- https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo.pdf : 5.8 -> 5.7MB
Risk
I don't know where the file size reduction is coming from. My suspicion is that it's related to floating point representations, e.g. 0.0
now being 0
or similar, but I didn't check. There is a slight risk in that part.
It would be ok for me to take that risk as the resulting PDF (e.g. GeoTopo-book.pdf
with watermark) looks fine and I trust our CI.
Removing commited code Co-authored-by: Martin Thoma <info@martin-thoma.de>
Thank you very much for the improvement @Lucas-C 🙏 That is an awesome first contribution to pypdf 🥳 If you want, I will add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-) |
Next steps:
|
## What's new ### Performance Improvements (PI) - Making pypdf as fast as pdfrw (#2086) ### Maintenance (MAINT) - Relax typing_extensions version (#2104) [Full Changelog](3.15.3...3.15.4)
Interestingly, this seems to fix #2112 as well and avoids generating invalid watermarks, thus not being a performance improvement only, but a bugfix as well. I will try to verify this tomorrow again when I am in the office. For #2112 some visual watermark test should probably be implemented anyway to avoid unexpectedly breaking this again in the future, as the correct watermark display cannot be verified by a simple file size check. |
I agree. Interestingly, I do have such a test in a private repository. That test succeeded, but maybe I didn't use the/a broken version? I need to check if I can transfer that test.
👍 |
I have been able to verify that this indeed fixes the general watermarking behavior from #2112 when doing some further tests yesterday. |
This is a follow-up of sarnold/pdfrw#15 (comment)
Using the code from this branch, pypdf_watermarking is as fast as
pdfrw_watermarking()
.pypdf
was made faster by avoiding unnecessarily calls to__parse_content_stream()
.In order to achieve that, some calls to
PageObject._push_pop_gs()
were removed inPageObject._merge_page()
andPageObject._merge_page_writer()