-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: accept utf strings for metadata #2802
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2802 +/- ##
==========================================
+ Coverage 95.83% 95.85% +0.01%
==========================================
Files 51 51
Lines 8544 8579 +35
Branches 1692 1696 +4
==========================================
+ Hits 8188 8223 +35
Misses 212 212
Partials 144 144 ☔ View full report in Codecov by Sentry. |
base modified because of this test tests/test_workflows.py::test_get_outline[https://corpora.tika.apache.org/base/docs/govdocs1/918/918137.pdf-tika-918137.pdf] - UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x00 in position 94: truncated data
Thank you for this fix! It solved a problem I had when entries in my table of contents (= bookmarks / document outline) contained quotation marks. Before this PR, using the following Python code, the second heading was not readable in the final document outline: from pypdf import PdfReader, PdfWriter
from weasyprint import HTML
PDF_NAME = "issue_with_toc_and_quotation_mark.pdf"
HTML(string='''
<h1>Heading OK</h1>
<h1>Heading with quotation marks </h1>
''').write_pdf(PDF_NAME)
# -> Document outline iS OK
writer = PdfWriter()
writer.append(PdfReader(PDF_NAME))
writer.write(PDF_NAME)
# -> Document outline iS KO This bug did not exist with |
## Version 5.0.0, 2024-09-15 This version drops support for Python 3.7 (not maintained since July 2023), PdfMerger (use PdfWriter instead) and AnnotationBuilder (use annotations instead). ### Deprecations (DEP) - Remove the deprecated PfdMerger and AnnotationBuilder classes and other deprecations cleanup (#2813) - Drop Python 3.7 support (#2793) ### New Features (ENH) - Add capability to remove /Info from PDF (#2820) - Add incremental capability to PdfWriter (#2811) - Add UniGB-UTF16 encodings (#2819) - Accept utf strings for metadata (#2802) - Report PdfReadError instead of RecursionError (#2800) - Compress PDF files merging identical objects (#2795) ### Bug Fixes (BUG) - Fix sheared image (#2801) ### Robustness (ROB) - Robustify .set_data() (#2821) - Raise PdfReadError when missing /Root in trailer (#2808) - Fix extract_text() issues on damaged PDFs (#2760) - Handle images with empty data when processing an image from bytes (#2786) ### Developer Experience (DEV) - Fix coverage uploads (#2832) - Test against Python 3.13 (#2776) [Full Changelog](4.3.1...5.0.0)
closes #2754