Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: page labels #680

Merged
merged 11 commits into from
Feb 1, 2022
Merged

Feature: page labels #680

merged 11 commits into from
Feb 1, 2022

Conversation

0xabu
Copy link
Contributor

@0xabu 0xabu commented Oct 13, 2021

PDF files may include optional page labels, which specify how the logical name for each page is to be constructed (when it differs from the default 1-based integers 1, 2, 3, ...). These may appear in longer documents for preambles and indices (where, e.g., lowercase roman numerals are often used), and it is helpful for tools that process a PDF file to be able to display these instead of a simple 1-based page number.

This PR adds support for extracting and formatting page labels. At the top level, PDFPage.label now stores the label of each page extracted from a PDF in which page labels are present.

Fix #705

Checklist

  • I have added tests that prove my fix is effective or that my feature
    works
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial
    version
  • I have updated the README.md or I am verified that this
    is not necessary
  • I have updated the readthedocs documentation or I
    verified that this is not necessary
  • I have added a concise human-readable description of the change to
    CHANGELOG.md

@0xabu
Copy link
Contributor Author

0xabu commented Oct 13, 2021

Design question: if page labels are not present, PDFPage.label is currently set to None. Would a 1-based page index (which is the typical fallback when labels are absent) be more useful?

@pietermarsman
Copy link
Member

I like this feature! :)

Looking at the current structure of PDFDocument, I think it becomes a bit chaotic if we keep adding new features in a similar way. I would prefer delegating parsing and storing of information from the Root (the catalog), Encrypt and Info entries of the file trailer in their own classes. Likewise, I would prefer parsing and storing information from the Pages, PageLabels, etc. from the document catalog it their own classes.

A similar issue (that is unrelated to this PR) is that PDFPage.create_pages() uses the PDFDocument to get its information. IMHO, it would make more sense if we move the create_pages() functionality to PDFDocument and allow accessing the pages by doing something like PDFDocument.catalog.pages. I.e. letting PDFDocument mimick the PDF structure better. A benefit of this approach is that it will clearly show which features we don't support yet (e.g. ViewerPreferences).

I know that at the moment the structure is different, and the PR follows the current structure. I think the right way to go is to implement this feature using the current structure, so as is, and create a new PR for refactoring the PDFDocument.

@pietermarsman
Copy link
Member

Design question: if page labels are not present, PDFPage.label is currently set to None. Would a 1-based page index (which is the typical fallback when labels are absent) be more useful?

I prefer PDFDocument to return the parsed information as is.

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature.

Very nice test cases!

I've added some comments on where I think the code can and should be improved.

Some general comments:

  • A link to the relevant sections from the PDF Reference would help future developpers. At least 3.4.4. (File Trailer), 8.3.1 (Page labels) and 3.8.6 (Number trees) should be mentioned somewhere.

pdfminer/pdfpage.py Show resolved Hide resolved
pdfminer/pdfdocument.py Outdated Show resolved Hide resolved
pdfminer/pdfpage.py Outdated Show resolved Hide resolved
@0xabu
Copy link
Contributor Author

0xabu commented Jan 31, 2022

  • A link to the relevant sections from the PDF Reference would help future developpers. At least 3.4.4. (File Trailer), 8.3.1 (Page labels) and 3.8.6 (Number trees) should be mentioned somewhere.

This and the suggested refactor are still pending.

pietermarsman and others added 4 commits January 31, 2022 20:57
…justifies inheriting its data and behavior. And it simplifies the code a bit more.
 * fix mypy errors (including tweaking code to avoid problematic dynamic types)
 * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be)
 * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end
@0xabu
Copy link
Contributor Author

0xabu commented Feb 1, 2022

@pietermarsman by the way, I see mypy & flake8 errors in files not touched by this PR. Are the automated tests still running correctly?

@0xabu
Copy link
Contributor Author

0xabu commented Feb 1, 2022

Oh, I see you're working through it on another branch.

@pietermarsman
Copy link
Member

Indeed. Automated tests have not been running for a while because travis.ci has put restrictions on free usage (due to people using it for bitcoin mining...). And some typing errors and test failures have crept in because of that. Proving automated testing is very valuable :)

@pietermarsman pietermarsman merged commit 1d1602e into pdfminer:develop Feb 1, 2022
@0xabu 0xabu deleted the page-labels branch February 1, 2022 17:06
Beants added a commit to HiTalentAlgorithms/pdfminer.six that referenced this pull request Feb 14, 2022
* develop:
  Check blackness in github actions (pdfminer#711)
  Changed `log.info` to  `log.debug` in six files (pdfminer#690)
  Update README.md batch for Continuous integration
  Update actions.yml so that it will run for all PR's
  Update development tools: travis ci to github actions, tox to nox, nose to pytest (pdfminer#704)
  Added feature: page labels (pdfminer#680)
  Remove obsolete returns (pdfminer#707)
  Revert "Remove obsolete returns"
  Remove obsolete returns
  Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (pdfminer#684)
  Use logger.warn instead of warnings.warn if warning cannot be prevented by user (pdfminer#673)
  Change log.info into log.debug to make pdfinterp.py less verbose
  Fix regression in page layout that sometimes returned text lines out of order (pdfminer#659)
  export type annotations in package (pdfminer#679)
  fix typos in PR template (pdfminer#681)
  pdf2txt: clean up construction of LAParams from arguments (pdfminer#682)
  Fixes jbig2 writer to write valid jb2 files
  Add support for JPEG2000 image encoding
  Added test case for CCITTFaxDecoder (pdfminer#700)
  Attempt to handle decompression error on some broken PDF files (pdfminer#637)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract page labels
2 participants