ENH: accept utf strings for metadata #2802

pubpub-zz · 2024-08-15T19:18:50Z

codecov · 2024-08-15T19:40:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.85%. Comparing base (454a62a) to head (a8d2155).
Report is 70 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2802      +/-   ##
==========================================
+ Coverage   95.83%   95.85%   +0.01%     
==========================================
  Files          51       51              
  Lines        8544     8579      +35     
  Branches     1692     1696       +4     
==========================================
+ Hits         8188     8223      +35     
  Misses        212      212              
  Partials      144      144

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

base modified because of this test tests/test_workflows.py::test_get_outline[https://corpora.tika.apache.org/base/docs/govdocs1/918/918137.pdf-tika-918137.pdf] - UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x00 in position 94: truncated data

Lucas-C · 2024-08-26T08:58:41Z

Thank you for this fix!

It solved a problem I had when entries in my table of contents (= bookmarks / document outline) contained quotation marks.

Before this PR, using the following Python code, the second heading was not readable in the final document outline:

from pypdf import PdfReader, PdfWriter
from weasyprint import HTML

PDF_NAME = "issue_with_toc_and_quotation_mark.pdf"

HTML(string='''
  <h1>Heading OK</h1>
  <h1>Heading with &nbsp;quotation marks&nbsp;</h1>
''').write_pdf(PDF_NAME)
# -> Document outline iS OK
writer = PdfWriter()
writer.append(PdfReader(PDF_NAME))
writer.write(PDF_NAME)
# -> Document outline iS KO

This bug did not exist with pypdf==4.2.0 but appeared in version 4.3.0

## Version 5.0.0, 2024-09-15 This version drops support for Python 3.7 (not maintained since July 2023), PdfMerger (use PdfWriter instead) and AnnotationBuilder (use annotations instead). ### Deprecations (DEP) - Remove the deprecated PfdMerger and AnnotationBuilder classes and other deprecations cleanup (#2813) - Drop Python 3.7 support (#2793) ### New Features (ENH) - Add capability to remove /Info from PDF (#2820) - Add incremental capability to PdfWriter (#2811) - Add UniGB-UTF16 encodings (#2819) - Accept utf strings for metadata (#2802) - Report PdfReadError instead of RecursionError (#2800) - Compress PDF files merging identical objects (#2795) ### Bug Fixes (BUG) - Fix sheared image (#2801) ### Robustness (ROB) - Robustify .set_data() (#2821) - Raise PdfReadError when missing /Root in trailer (#2808) - Fix extract_text() issues on damaged PDFs (#2760) - Handle images with empty data when processing an image from bytes (#2786) ### Developer Experience (DEV) - Fix coverage uploads (#2832) - Test against Python 3.13 (#2776) [Full Changelog](4.3.1...5.0.0)

pubpub-zz added 2 commits August 15, 2024 21:02

ENH: accept utf strings for metadata

73bd1e8

closes py-pdf#2754

oups

cfe8358

pubpub-zz marked this pull request as draft August 15, 2024 21:57

coverage+

442fae4

base modified because of this test tests/test_workflows.py::test_get_outline[https://corpora.tika.apache.org/base/docs/govdocs1/918/918137.pdf-tika-918137.pdf] - UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x00 in position 94: truncated data

pubpub-zz force-pushed the iss2754 branch from 13c9927 to 442fae4 Compare August 16, 2024 08:22

pubpub-zz added 2 commits August 16, 2024 10:35

coverage

ad6e77e

missed coverage

a8d2155

pubpub-zz marked this pull request as ready for review August 16, 2024 09:14

pubpub-zz requested a review from stefan6419846 August 16, 2024 09:14

stefan6419846 approved these changes Aug 16, 2024

View reviewed changes

stefan6419846 merged commit 0c81f3c into py-pdf:main Aug 16, 2024
16 checks passed

pubpub-zz mentioned this pull request Sep 15, 2024

REL: 5.0.0 #2851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: accept utf strings for metadata #2802

ENH: accept utf strings for metadata #2802

pubpub-zz commented Aug 15, 2024

codecov bot commented Aug 15, 2024 •

edited

Loading

Lucas-C commented Aug 26, 2024 •

edited

Loading

ENH: accept utf strings for metadata #2802

ENH: accept utf strings for metadata #2802

Conversation

pubpub-zz commented Aug 15, 2024

codecov bot commented Aug 15, 2024 • edited Loading

Codecov Report

Lucas-C commented Aug 26, 2024 • edited Loading

codecov bot commented Aug 15, 2024 •

edited

Loading

Lucas-C commented Aug 26, 2024 •

edited

Loading