-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: page labels #680
Feature: page labels #680
Conversation
Design question: if page labels are not present, |
I like this feature! :) Looking at the current structure of A similar issue (that is unrelated to this PR) is that I know that at the moment the structure is different, and the PR follows the current structure. I think the right way to go is to implement this feature using the current structure, so as is, and create a new PR for refactoring the |
I prefer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature.
Very nice test cases!
I've added some comments on where I think the code can and should be improved.
Some general comments:
- A link to the relevant sections from the PDF Reference would help future developpers. At least 3.4.4. (File Trailer), 8.3.1 (Page labels) and 3.8.6 (Number trees) should be mentioned somewhere.
This and the suggested refactor are still pending. |
…justifies inheriting its data and behavior. And it simplifies the code a bit more.
* fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end
@pietermarsman by the way, I see mypy & flake8 errors in files not touched by this PR. Are the automated tests still running correctly? |
Oh, I see you're working through it on another branch. |
Indeed. Automated tests have not been running for a while because travis.ci has put restrictions on free usage (due to people using it for bitcoin mining...). And some typing errors and test failures have crept in because of that. Proving automated testing is very valuable :) |
* develop: Check blackness in github actions (pdfminer#711) Changed `log.info` to `log.debug` in six files (pdfminer#690) Update README.md batch for Continuous integration Update actions.yml so that it will run for all PR's Update development tools: travis ci to github actions, tox to nox, nose to pytest (pdfminer#704) Added feature: page labels (pdfminer#680) Remove obsolete returns (pdfminer#707) Revert "Remove obsolete returns" Remove obsolete returns Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (pdfminer#684) Use logger.warn instead of warnings.warn if warning cannot be prevented by user (pdfminer#673) Change log.info into log.debug to make pdfinterp.py less verbose Fix regression in page layout that sometimes returned text lines out of order (pdfminer#659) export type annotations in package (pdfminer#679) fix typos in PR template (pdfminer#681) pdf2txt: clean up construction of LAParams from arguments (pdfminer#682) Fixes jbig2 writer to write valid jb2 files Add support for JPEG2000 image encoding Added test case for CCITTFaxDecoder (pdfminer#700) Attempt to handle decompression error on some broken PDF files (pdfminer#637)
PDF files may include optional page labels, which specify how the logical name for each page is to be constructed (when it differs from the default 1-based integers 1, 2, 3, ...). These may appear in longer documents for preambles and indices (where, e.g., lowercase roman numerals are often used), and it is helpful for tools that process a PDF file to be able to display these instead of a simple 1-based page number.
This PR adds support for extracting and formatting page labels. At the top level, PDFPage.label now stores the label of each page extracted from a PDF in which page labels are present.
Fix #705
Checklist
works
version
is not necessary
verified that this is not necessary
CHANGELOG.md