Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.3.0] Regression when decoding strings #2754

Closed
kwist-sgr opened this issue Jul 15, 2024 · 0 comments · Fixed by #2802
Closed

[4.3.0] Regression when decoding strings #2754

kwist-sgr opened this issue Jul 15, 2024 · 0 comments · Fixed by #2802
Labels
is-regression Regression introduced as a side-effect of another change PdfReader The PdfReader component is affected

Comments

@kwist-sgr
Copy link

kwist-sgr commented Jul 15, 2024

String in metadata with the symbol aren't subtracted

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.5.0-41-generic-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.0, crypt_provider=('cryptography', '42.0.8'), PIL=10.4.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pathlib import Path

from pypdf import PdfReader, PdfWriter


def test_subject():
    src = Path(__file__).parent / 'metadata.pdf'
    source = src.read_bytes()
    with BytesIO() as stream:
        reader, writer = PdfReader(BytesIO(source)), PdfWriter(stream)
        list(map(writer.add_page, reader.pages))
        new_metadata = {
            '/Subject': 'Invoice №AI_047',
        }
        writer.add_metadata(new_metadata)
        writer.write(stream)

        new_reader = PdfReader(BytesIO(stream.getvalue()))
        metadata = new_reader.metadata
        assert metadata.subject is None
        assert metadata.subject_raw.decode() == 'Invoice №AI_047'

Share here the PDF file(s) that caused the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
https://github.com/py-pdf/pypdf/blob/main/resources/metadata.pdf

Traceback

This is the complete traceback I see:

...
            new_reader = PdfReader(BytesIO(stream.getvalue()))
            metadata = new_reader.metadata
            assert metadata.subject is None
>           assert metadata.subject_raw.decode() == 'Invoice №AI_047'
E           AssertionError: assert '\x00I\x00n\x...000\x004\x007' == 'Invoice №AI_047'
E             
E             - Invoice №AI_047
E             + Invoice !AI_047

The previous version, in this case, works as expected.
I'm guessing this behavior has changed since these changes: #2675

@stefan6419846 stefan6419846 added PdfReader The PdfReader component is affected is-regression Regression introduced as a side-effect of another change labels Jul 15, 2024
pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-regression Regression introduced as a side-effect of another change PdfReader The PdfReader component is affected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants