Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Use bytearray instead of b"" in encode_pdfdocencoding #2325

Merged
merged 1 commit into from
Dec 4, 2023
Merged

PERF: Use bytearray instead of b"" in encode_pdfdocencoding #2325

merged 1 commit into from
Dec 4, 2023

Conversation

zuypt
Copy link
Contributor

@zuypt zuypt commented Dec 4, 2023

Since b"" is not mutable it causes python to allocate and deallocate memory repeatedly in the for loop which cause hang/long runtime when handle very large string. For example when using add_js to to add a very big javascript code.

Since b"" is not mutable it causes python to allocate and deallocate memory repeatedly in the for loop which cause hang/long runtime when handle very large string. For example when using add_js to  to add a very big javascript code.
Copy link

codecov bot commented Dec 4, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (40e25ec) 94.37% compared to head (e3ec6cc) 94.37%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2325   +/-   ##
=======================================
  Coverage   94.37%   94.37%           
=======================================
  Files          43       43           
  Lines        7660     7660           
  Branches     1515     1515           
=======================================
  Hits         7229     7229           
  Misses        267      267           
  Partials      164      164           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stefan6419846
Copy link
Collaborator

Could you please update the title to use the recommended naming scheme? https://pypdf.readthedocs.io/en/latest/dev/intro.html#commit-messages

@MartinThoma MartinThoma changed the title Update _base.py PERF: Update _base.py Dec 4, 2023
@MartinThoma MartinThoma changed the title PERF: Update _base.py PERF: Use bytearray instead of b"" in encode_pdfdocencoding Dec 4, 2023
@MartinThoma
Copy link
Member

@zuypt Do you have an example that shows the difference? (It could be a toy-example - I'm just curious :-) )

@zuypt
Copy link
Contributor Author

zuypt commented Dec 4, 2023

Could you please update the title to use the recommended naming scheme? https://pypdf.readthedocs.io/en/latest/dev/intro.html#commit-messages

I'm too lazy if some one have permission please help

@zuypt
Copy link
Contributor Author

zuypt commented Dec 4, 2023

@zuypt Do you have an example that shows the difference? (It could be a toy-example - I'm just curious :-) )

just create a PdfWriter then call add_js with a super large string you will see. This is a pretty common python programming error.

@MartinThoma
Copy link
Member

I've already adjusted the title

@MartinThoma
Copy link
Member

MartinThoma commented Dec 4, 2023

import timeit

def benchmark_empty_bytes_literal():
    result = b""
    for _ in range(100000):
        result += b"a"

def benchmark_bytes_object():
    result = bytearray()
    for _ in range(100000):
        result += b"a"

if __name__ == "__main__":
    empty_bytes_literal_time = timeit.timeit(benchmark_empty_bytes_literal, number=100)
    bytes_object_time = timeit.timeit(benchmark_bytes_object, number=100)

    print(f"Empty Bytes Literal Time: {empty_bytes_literal_time:.1f}")
    print(f"bytearray Time: {bytes_object_time:.1f}")

shows:

Empty Bytes Literal Time: 21.4
bytearray Time: 0.5

@MartinThoma MartinThoma merged commit 6cb5343 into py-pdf:main Dec 4, 2023
14 checks passed
@MartinThoma
Copy link
Member

@zuypt Thanks for your contribution! If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

@MartinThoma
Copy link
Member

It will be part of the next release on Sunday.

@zuypt
Copy link
Contributor Author

zuypt commented Dec 6, 2023

@zuypt Thanks for your contribution! If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

sure. Thanks for the recognition

MartinThoma added a commit that referenced this pull request Dec 10, 2023
## What's new

### Bug Fixes (BUG)
-  Cope with deflated images with CMYK Black Only (#2322) by @pubpub-zz
-  Handle indirect objects as parameters for CCITTFaxDecode (#2307) by @stefan6419846
-  check words length in _cmap type1_alternative function (#2310) by @Takher

### Robustness (ROB)
-  Relax flate decoding for too many lookup values (#2331) by @stefan6419846
-  Let _build_destination skip in case of missing /D key (#2018) by @nickryand

### Documentation (DOC)
-  Note in reading form data (#2338) by @MartinThoma
-  Pull Request prefixes and size by @MartinThoma
-  Add https://github.com/zuypt for #2325 as a contributor by @MartinThoma
-  Fix docstring for RunLengthDecode.decode (#2302) by @stefan6419846

### Maintenance (MAINT)
-  Enable `disallow_any_generics` and add missing generics (#2278) by @nilehmann

### Testing (TST)
-  Centralize file downloads (#2324) by @MartinThoma

### Code Style (STY)
-  Fix typo "steam" \xe2\x86\x92 "stream" (#2327) by @stefan6419846
-  Run black by @MartinThoma
-  Make Traceback in bug report template uppercase (#2304) by @stefan6419846

[Full Changelog](3.17.1...3.17.2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants