[4.3.0] Regression when decoding strings #2754

kwist-sgr · 2024-07-15T09:40:09Z

String in metadata with the symbol № aren't subtracted

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.5.0-41-generic-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.0, crypt_provider=('cryptography', '42.0.8'), PIL=10.4.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pathlib import Path

from pypdf import PdfReader, PdfWriter


def test_subject():
    src = Path(__file__).parent / 'metadata.pdf'
    source = src.read_bytes()
    with BytesIO() as stream:
        reader, writer = PdfReader(BytesIO(source)), PdfWriter(stream)
        list(map(writer.add_page, reader.pages))
        new_metadata = {
            '/Subject': 'Invoice №AI_047',
        }
        writer.add_metadata(new_metadata)
        writer.write(stream)

        new_reader = PdfReader(BytesIO(stream.getvalue()))
        metadata = new_reader.metadata
        assert metadata.subject is None
        assert metadata.subject_raw.decode() == 'Invoice №AI_047'

Share here the PDF file(s) that caused the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
https://github.com/py-pdf/pypdf/blob/main/resources/metadata.pdf

Traceback

This is the complete traceback I see:

...
            new_reader = PdfReader(BytesIO(stream.getvalue()))
            metadata = new_reader.metadata
            assert metadata.subject is None
>           assert metadata.subject_raw.decode() == 'Invoice №AI_047'
E           AssertionError: assert '\x00I\x00n\x...000\x004\x007' == 'Invoice №AI_047'
E             
E             - Invoice №AI_047
E             + Invoice !AI_047

The previous version, in this case, works as expected.
I'm guessing this behavior has changed since these changes: #2675

The text was updated successfully, but these errors were encountered:

closes py-pdf#2754

stefan6419846 added PdfReader The PdfReader component is affected is-regression Regression introduced as a side-effect of another change labels Jul 15, 2024

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 15, 2024

ENH: accept utf strings for metadata

73bd1e8

closes py-pdf#2754

pubpub-zz mentioned this issue Aug 15, 2024

ENH: accept utf strings for metadata #2802

Merged

stefan6419846 closed this as completed in #2802 Aug 16, 2024

stefan6419846 closed this as completed in 0c81f3c Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4.3.0] Regression when decoding strings #2754

[4.3.0] Regression when decoding strings #2754

kwist-sgr commented Jul 15, 2024 •

edited

Loading

[4.3.0] Regression when decoding strings #2754

[4.3.0] Regression when decoding strings #2754

Comments

kwist-sgr commented Jul 15, 2024 • edited Loading

Environment

Code + PDF

Traceback

kwist-sgr commented Jul 15, 2024 •

edited

Loading