Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add standard compliant default identifier #21

Merged
merged 2 commits into from
Feb 26, 2024

Conversation

wolfgangwalther
Copy link
Contributor

@wolfgangwalther wolfgangwalther commented Feb 15, 2024

To create standard-complying PDFs with an identifier, the current identifier option is not enough. The standard reads:

File identifiers shall be defined by the optional ID entry in a PDF file’s trailer dictionary (see 7.5.5, “File Trailer”).
The ID entry is optional but should be used. The value of this entry shall be an array of two byte strings. The
first byte string shall be a permanent identifier based on the contents of the file at the time it was originally
created and shall not change when the file is incrementally updated. The second byte string shall be a
changing identifier based on the file’s contents at the time it was last updated. When a file is first written, both
identifiers shall be set to the same value. If both identifiers match when a file reference is resolved, it is very
likely that the correct and unchanged file has been found. If only the first identifier matches, a different version
of the correct file has been found.

To help ensure the uniqueness of file identifiers, they should be computed by means of a message digest
algorithm such as MD5 (described in Internet RFC 1321, The MD5 Message-Digest Algorithm; see the
Bibliography), using the following information:

(emphasis mine)

The second value of the ID array is always set to a hash of the document's objects. This is fine. But it's currently impossible to set the first value accordingly, because it's just not known before I create the document.

This PR creates an identifier by default, when identifier=None. It uses the same hash as the first component, as mandated by the spec. To create a new revision of the same document, the user can then take this ID from the original revision and pass it to the identifier argument - this will then create a new revision with proper IDs.

This will always create documents with identifiers, even though the ID in general is optional according to spec. I don't think that's a bad thing though. This goes in line with what is asked for here: Kozea/WeasyPrint#1661 - PDF/A compliance by default.

@grewn0uille grewn0uille added the feature New feature that should be supported label Feb 15, 2024
@liZe
Copy link
Member

liZe commented Feb 26, 2024

Hi,

Thanks for your pull request!

This will always create documents with identifiers, even though the ID in general is optional according to spec. I don't think that's a bad thing though. This goes in line with what is asked for here: Kozea/WeasyPrint#1661 - PDF/A compliance by default.

I’ve updated the API to include no identifier by default, and to generate/include one when needed. We’ll let WeasyPrint (and pydyf) users the choice to include one or not.

If everything’s OK for you, it’s ready to be merged.

@wolfgangwalther
Copy link
Contributor Author

I’ve updated the API to include no identifier by default, and to generate/include one when needed. We’ll let WeasyPrint (and pydyf) users the choice to include one or not.

Makes sense!

If everything’s OK for you, it’s ready to be merged.

Yes, looks good, thanks!

@liZe liZe merged commit 30cad31 into CourtBouillon:main Feb 26, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature that should be supported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants