Ignore path constructors that do not begin with m #749

jsvine · 2022-04-22T16:44:19Z

I have encountered a couple of PDFs (see jsvine/pdfplumber#636 (comment) and jsvine/pdfplumber#644 (comment)) that include a ("h", ) path — i.e., just the h operator and no others. This causes this second statement here to throw ValueError: not enough values to unpack (expected 2, got 1):

pdfminer.six/pdfminer/converter.py

Lines 126 to 129 in 1bf3c42

    
           raw_pts = [ 
        
               cast(Point, p[-2:] if p[0] != "h" else path[0][-2:]) for p in path 
        
           ] 
        
           pts = [apply_matrix_pt(self.ctm, pt) for pt in raw_pts]

Pull request

Although there may be narrower solutions to this particular problem (e.g., handling the specific edge-case of ("h",)), this PR introduces what — at least to me — seems like the most generalized-yet-accurate solution, after reading the PDF Reference.

Per Section 4.4.1, "path construction operators may be invoked in any sequence, but the first one invoked must be m or re to begin a new subpath." Since pdfminer.six already converts all re (rectangle) operators to their equivelent mlllh representation, paths ingested by .paint_path(...) that do not begin with the m operator are invalid.

I am, however, open to other approaches 👍

How Has This Been Tested?

I have added a test to tests/test_converter.py. I have also examined all non-encrypted PDFs in the samples/ directory. Of the 31,116 paths and subpaths in those 33 PDFs, all start with the “m” operator (or have been translated from the “re” operator to take that form).

Checklist

I have formatted my code with black.
I have added tests that prove my fix is effective or that my feature
works
[not applicable] I have added docstrings to newly created methods and classes
I have optimized the code at least one time after creating the initial
version
I have updated the README.md or verified that this
is not necessary
I have updated the readthedocs documentation or
verified that this is not necessary
I have added a concise human-readable description of the change to
CHANGELOG.md

Per PDF Reference Section 4.4.1, "path construction operators may be invoked in any sequence, but the first one invoked must be m or re to begin a new subpath." Since pdfminer.six already converts all `re` (rectangle) operators to their equivelent `mlllh` representation, paths ingested by `.paint_path(...)` that do not begin with the `m` operator are invalid. In addition to the advantage of hewing to the PDF Reference, this change also avoids the `ValueError: not enough values to unpack (expected 2, got 1)` error raised by the ` pts = [apply_matrix_pt(self.ctm, pt) for pt in raw_pts]` line in `converter.py` when parsing PDFs that (erroneously) include `("h",)` paths.

Nebukadneza · 2022-05-04T06:26:59Z

Not a review, but user-originating test: This solves the problem I initially encountered in #729!

pietermarsman · 2022-05-06T20:15:52Z

Thanks @jsvine and @Nebukadneza!

pietermarsman · 2022-05-06T20:16:19Z

Fixes #729

* commit '8f52578e85b27831ab8a68a6d86721ea3348a553': Run black locally with nox (pdfminer#776) Install typing_extensions on Python 3.6 and 3.7 (pdfminer#775) Fix `TypeError` by Ignoring null characters in PSBaseParser (pdfminer#768) Fix `ValueError` with unencrypted metadata values (Fixes pdfminer#766). (pdfminer#774) Fix `TypeError` when getting default width of font (pdfminer#772) Deprecate usage of `if __name__ == "__main__"` in scripts that are not documented. Also deprecate usage of scripts that are only there for testing purposes. (pdfminer#756) Fix Sphinx warnings and error (pdfminer#760) Update CHANGELOG.md for pdfminer#755 Remove upper version bounds (pdfminer#755) Ignore path constructors that do not begin with m (pdfminer#749) Bump version 20220506 & fix small issue with types

jsvine mentioned this pull request May 3, 2022

ValueError: not enough values to unpack (expected 2, got 1) in pdfminer/converter.py in paint_path() function in develop branch #729

Closed

Update CHANGELOG.md

dd56335

pietermarsman merged commit f2c967f into pdfminer:master May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore path constructors that do not begin with m #749

Ignore path constructors that do not begin with m #749

jsvine commented Apr 22, 2022 •

edited by pietermarsman

Loading

Nebukadneza commented May 4, 2022

pietermarsman commented May 6, 2022

pietermarsman commented May 6, 2022

	raw_pts = [
	cast(Point, p[-2:] if p[0] != "h" else path[0][-2:]) for p in path
	]
	pts = [apply_matrix_pt(self.ctm, pt) for pt in raw_pts]

Ignore path constructors that do not begin with m #749

Ignore path constructors that do not begin with m #749

Conversation

jsvine commented Apr 22, 2022 • edited by pietermarsman Loading

Nebukadneza commented May 4, 2022

pietermarsman commented May 6, 2022

pietermarsman commented May 6, 2022

jsvine commented Apr 22, 2022 •

edited by pietermarsman

Loading