Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore path constructors that do not begin with m #749

Merged
merged 2 commits into from
May 6, 2022

Conversation

jsvine
Copy link
Contributor

@jsvine jsvine commented Apr 22, 2022

I have encountered a couple of PDFs (see jsvine/pdfplumber#636 (comment) and jsvine/pdfplumber#644 (comment)) that include a ("h", ) path — i.e., just the h operator and no others. This causes this second statement here to throw ValueError: not enough values to unpack (expected 2, got 1):

raw_pts = [
cast(Point, p[-2:] if p[0] != "h" else path[0][-2:]) for p in path
]
pts = [apply_matrix_pt(self.ctm, pt) for pt in raw_pts]

Pull request

Although there may be narrower solutions to this particular problem (e.g., handling the specific edge-case of ("h",)), this PR introduces what — at least to me — seems like the most generalized-yet-accurate solution, after reading the PDF Reference.

Per Section 4.4.1, "path construction operators may be invoked in any sequence, but the first one invoked must be m or re to begin a new subpath." Since pdfminer.six already converts all re (rectangle) operators to their equivelent mlllh representation, paths ingested by .paint_path(...) that do not begin with the m operator are invalid.

I am, however, open to other approaches 👍

How Has This Been Tested?

I have added a test to tests/test_converter.py. I have also examined all non-encrypted PDFs in the samples/ directory. Of the 31,116 paths and subpaths in those 33 PDFs, all start with the “m” operator (or have been translated from the “re” operator to take that form).

Checklist

  • I have formatted my code with black.
  • I have added tests that prove my fix is effective or that my feature
    works
  • [not applicable] I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial
    version
  • I have updated the README.md or verified that this
    is not necessary
  • I have updated the readthedocs documentation or
    verified that this is not necessary
  • I have added a concise human-readable description of the change to
    CHANGELOG.md

Per PDF Reference Section 4.4.1, "path construction operators may be
invoked in any sequence, but the first one invoked must be m or re to
begin a new subpath." Since pdfminer.six already converts all `re`
(rectangle) operators to their equivelent `mlllh` representation, paths
ingested by `.paint_path(...)` that do not begin with the `m` operator
are invalid.

In addition to the advantage of hewing to the PDF Reference, this change
also avoids the `ValueError: not enough values to unpack (expected 2,
got 1)` error raised by the ` pts = [apply_matrix_pt(self.ctm, pt) for
pt in raw_pts]` line in `converter.py` when parsing PDFs that
(erroneously) include `("h",)` paths.
@Nebukadneza
Copy link

Not a review, but user-originating test: This solves the problem I initially encountered in #729!

@pietermarsman pietermarsman merged commit f2c967f into pdfminer:master May 6, 2022
@pietermarsman
Copy link
Member

Thanks @jsvine and @Nebukadneza!

@pietermarsman
Copy link
Member

Fixes #729

Beants added a commit to HiTalentAlgorithms/pdfminer.six that referenced this pull request Aug 5, 2022
* commit '8f52578e85b27831ab8a68a6d86721ea3348a553':
  Run black locally with nox (pdfminer#776)
  Install typing_extensions on Python 3.6 and 3.7 (pdfminer#775)
  Fix `TypeError` by Ignoring null characters in PSBaseParser (pdfminer#768)
  Fix `ValueError` with unencrypted metadata values (Fixes pdfminer#766). (pdfminer#774)
  Fix `TypeError` when getting default width of font (pdfminer#772)
  Deprecate usage of `if __name__ == "__main__"` in scripts that are not documented. Also deprecate usage of scripts that are only there for testing purposes. (pdfminer#756)
  Fix Sphinx warnings and error (pdfminer#760)
  Update CHANGELOG.md for pdfminer#755
  Remove upper version bounds (pdfminer#755)
  Ignore path constructors that do not begin with  m (pdfminer#749)
  Bump version 20220506 & fix small issue with types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants