Feature: page labels #680

0xabu · 2021-10-13T18:19:40Z

PDF files may include optional page labels, which specify how the logical name for each page is to be constructed (when it differs from the default 1-based integers 1, 2, 3, ...). These may appear in longer documents for preambles and indices (where, e.g., lowercase roman numerals are often used), and it is helpful for tools that process a PDF file to be able to display these instead of a simple 1-based page number.

This PR adds support for extracting and formatting page labels. At the top level, PDFPage.label now stores the label of each page extracted from a PDF in which page labels are present.

Fix #705

Checklist

I have added tests that prove my fix is effective or that my feature
works
I have added docstrings to newly created methods and classes
I have optimized the code at least one time after creating the initial
version
I have updated the README.md or I am verified that this
is not necessary
I have updated the readthedocs documentation or I
verified that this is not necessary
I have added a concise human-readable description of the change to
CHANGELOG.md

0xabu · 2021-10-13T18:19:51Z

Design question: if page labels are not present, PDFPage.label is currently set to None. Would a 1-based page index (which is the typical fallback when labels are absent) be more useful?

pietermarsman · 2022-01-30T12:27:42Z

I like this feature! :)

Looking at the current structure of PDFDocument, I think it becomes a bit chaotic if we keep adding new features in a similar way. I would prefer delegating parsing and storing of information from the Root (the catalog), Encrypt and Info entries of the file trailer in their own classes. Likewise, I would prefer parsing and storing information from the Pages, PageLabels, etc. from the document catalog it their own classes.

A similar issue (that is unrelated to this PR) is that PDFPage.create_pages() uses the PDFDocument to get its information. IMHO, it would make more sense if we move the create_pages() functionality to PDFDocument and allow accessing the pages by doing something like PDFDocument.catalog.pages. I.e. letting PDFDocument mimick the PDF structure better. A benefit of this approach is that it will clearly show which features we don't support yet (e.g. ViewerPreferences).

I know that at the moment the structure is different, and the PR follows the current structure. I think the right way to go is to implement this feature using the current structure, so as is, and create a new PR for refactoring the PDFDocument.

pietermarsman · 2022-01-30T12:28:33Z

Design question: if page labels are not present, PDFPage.label is currently set to None. Would a 1-based page index (which is the typical fallback when labels are absent) be more useful?

I prefer PDFDocument to return the parsed information as is.

pietermarsman

Thanks for adding this feature.

Very nice test cases!

I've added some comments on where I think the code can and should be improved.

Some general comments:

A link to the relevant sections from the PDF Reference would help future developpers. At least 3.4.4. (File Trailer), 8.3.1 (Page labels) and 3.8.6 (Number trees) should be mentioned somewhere.

pdfminer/pdfpage.py

pdfminer/pdfdocument.py

pdfminer/pdfpage.py

pdfminer/pdfdocument.py

0xabu · 2022-01-31T05:01:26Z

A link to the relevant sections from the PDF Reference would help future developpers. At least 3.4.4. (File Trailer), 8.3.1 (Page labels) and 3.8.6 (Number trees) should be mentioned somewhere.

This and the suggested refactor are still pending.

…geLabels class.

…justifies inheriting its data and behavior. And it simplifies the code a bit more.

* fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end

0xabu · 2022-02-01T05:22:46Z

@pietermarsman by the way, I see mypy & flake8 errors in files not touched by this PR. Are the automated tests still running correctly?

0xabu · 2022-02-01T05:24:24Z

Oh, I see you're working through it on another branch.

pietermarsman · 2022-02-01T09:07:42Z

Indeed. Automated tests have not been running for a while because travis.ci has put restrictions on free usage (due to people using it for bitcoin mining...). And some typing errors and test failures have crept in because of that. Proving automated testing is very valuable :)

* develop: Check blackness in github actions (pdfminer#711) Changed `log.info` to `log.debug` in six files (pdfminer#690) Update README.md batch for Continuous integration Update actions.yml so that it will run for all PR's Update development tools: travis ci to github actions, tox to nox, nose to pytest (pdfminer#704) Added feature: page labels (pdfminer#680) Remove obsolete returns (pdfminer#707) Revert "Remove obsolete returns" Remove obsolete returns Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (pdfminer#684) Use logger.warn instead of warnings.warn if warning cannot be prevented by user (pdfminer#673) Change log.info into log.debug to make pdfinterp.py less verbose Fix regression in page layout that sometimes returned text lines out of order (pdfminer#659) export type annotations in package (pdfminer#679) fix typos in PR template (pdfminer#681) pdf2txt: clean up construction of LAParams from arguments (pdfminer#682) Fixes jbig2 writer to write valid jb2 files Add support for JPEG2000 image encoding Added test case for CCITTFaxDecoder (pdfminer#700) Attempt to handle decompression error on some broken PDF files (pdfminer#637)

0xabu added 4 commits October 12, 2021 17:20

port page label code from pdfannots

94cf788

add tests and clean up

ada7ae2

more cleanup; harden against non-conforming input

affc1d3

one more test

114c5ef

0xabu and others added 2 commits October 13, 2021 11:22

update CHANGELOG

e6ae933

Merge branch 'develop' into page-labels

a061a0d

pietermarsman requested changes Jan 30, 2022

View reviewed changes

pdfminer/pdfpage.py Show resolved Hide resolved

pdfminer/pdfdocument.py Outdated Show resolved Hide resolved

pdfminer/pdfpage.py Outdated Show resolved Hide resolved

pietermarsman reviewed Jan 30, 2022

View reviewed changes

pdfminer/pdfdocument.py Outdated Show resolved Hide resolved

pietermarsman reviewed Jan 30, 2022

View reviewed changes

pdfminer/pdfdocument.py Outdated Show resolved Hide resolved

cleanup & respond to review feedback (incomplete)

78a4dba

pietermarsman and others added 4 commits January 31, 2022 20:57

Refactor implementation of get_page_labels() into a NumberTree and Pa…

010229a

…geLabels class.

PageLabels *is* a NumberTree and should always behave like one. This …

483a716

…justifies inheriting its data and behavior. And it simplifies the code a bit more.

Merge branch 'develop' into page-labels

8342e02

pietermarsman approved these changes Feb 1, 2022

View reviewed changes

pietermarsman merged commit 1d1602e into pdfminer:develop Feb 1, 2022

0xabu deleted the page-labels branch February 1, 2022 17:06

0xabu mentioned this pull request Mar 19, 2022

Use page labels from pdfminer 0xabu/pdfannots#57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: page labels #680

Feature: page labels #680

0xabu commented Oct 13, 2021 •

edited by pietermarsman

Loading

0xabu commented Oct 13, 2021

pietermarsman commented Jan 30, 2022

pietermarsman commented Jan 30, 2022

pietermarsman left a comment

0xabu commented Jan 31, 2022

0xabu commented Feb 1, 2022

0xabu commented Feb 1, 2022

pietermarsman commented Feb 1, 2022

Feature: page labels #680

Feature: page labels #680

Conversation

0xabu commented Oct 13, 2021 • edited by pietermarsman Loading

0xabu commented Oct 13, 2021

pietermarsman commented Jan 30, 2022

pietermarsman commented Jan 30, 2022

pietermarsman left a comment

Choose a reason for hiding this comment

0xabu commented Jan 31, 2022

0xabu commented Feb 1, 2022

0xabu commented Feb 1, 2022

pietermarsman commented Feb 1, 2022

0xabu commented Oct 13, 2021 •

edited by pietermarsman

Loading