-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in new .paint_path logic #473
Comments
Thanks for the thorough analysis! I would really like it if you file a PR. This method is ideal for unit testing, and we have some of them. But perhaps you could add your examples to the unit test as well. See test_converter.py. |
jsvine
added a commit
to jsvine/pdfminer.six
that referenced
this issue
Sep 26, 2020
Focuses on the decomposition of complex (m.*h)* paths into subpaths, and assigning those subpaths the correct LTCurve/LTRect type.
jsvine
added a commit
to jsvine/pdfminer.six
that referenced
this issue
Sep 26, 2020
Focuses on the decomposition of complex (m.*h)* paths into subpaths, and assigning those subpaths the correct LTCurve/LTRect type. Also adds a test for cases presented in issue pdfminer#473
jsvine
added a commit
to jsvine/pdfminer.six
that referenced
this issue
Sep 26, 2020
Focuses on the handling of non-rect quadrilaterals, the decomposition of complex (m.*h)* paths into subpaths, and assigning those subpaths the correct LTCurve/LTRect type. Also adds a test for cases presented in issue pdfminer#473
6 tasks
Thanks! PR now submitted, including a test of the examples above. |
jsvine
added a commit
to jsvine/pdfminer.six
that referenced
this issue
Sep 26, 2020
Focuses on the handling of non-rect quadrilaterals, the decomposition of complex (m.*h)* paths into subpaths, and assigning those subpaths the correct LTCurve/LTRect type. Also adds a test for cases presented in issue pdfminer#473
jsvine
added a commit
to jsvine/pdfminer.six
that referenced
this issue
Sep 26, 2020
Focuses on the handling of non-rect quadrilaterals, the decomposition of complex (m.*h)* paths into subpaths, and assigning those subpaths the correct LTCurve/LTRect type. Also adds a test for cases presented in issue pdfminer#473
pietermarsman
added a commit
that referenced
this issue
Oct 12, 2020
* Fix paint_path bug noted in issue #473 Focuses on the handling of non-rect quadrilaterals, the decomposition of complex (m.*h)* paths into subpaths, and assigning those subpaths the correct LTCurve/LTRect type. Also adds a test for cases presented in issue #473 * Tweak paint_path fix per @pietermarsman review - Adjusts logic to adhere to if-elif-else rather than early returns. - Shortens subpath detection/reprocessing step, using re.finditer(). * Reorder paint_path() if-else statements once more * Fix flake8 issues * Fix error: should select item 1 and 2 from the list, and possible items [3, 4], and so on. Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Closed by #512 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Bug report
First off, thank you for this wonderful library! This is my first time opening an issue on this project, but I've been watching (and using and appreciating)
pdfminer.six
for a long time.Commit 60863cf introduced new logic (still present at
develop
) forPDFLayoutAnalyzer.paint_path(...)
, via PR #371, aimed at fixing issue #369. Unfortunately, the changes appear to introduce new bugs. Specifically: The new code appears to ignore/reject all non-rectanglemlllh
-shape paths. It also appears not to fully decompose paths into their subpaths. Examples, details, and proposal below.For demonstration purposes, I have created the following PDF: quadrilaterals.pdf
Here's a PNG rendering of the PDF, for ease of reference:
And here's how the paths are defined in the PDF:
Black square:
Red bowtie:
Green quadrilateral:
Two blue rectangles:
Purple rectangle + pentagon:
To examine the differences in
.paint_path
before and after the commit, I'm using this short Python program:On 6a9269b, the commit preceding 60863cf, this is the result:
As issue #369 correctly pointed out, the blue path is being parsed as one eight-point curve, rather than two rectangles. Likewise, the purple shape is being parsed as a nine-point curve, rather than a rectangle and a five-point curve. The other shapes (black, red, green) are being parsed correctly.
Commit 60863cf solves the issue with the blue path, but introduces new problems. Here is the output of that same test program:
As the output indicates, the red and green quadrilaterals are ignored entirely. The cause appears to stem from converting the
if
statement in the prior code to anelif
statement here:pdfminer.six/pdfminer/converter.py
Lines 91 to 105 in 60863cf
That is, if the path's shape is
mlllh
but the path is not a rectangle — i.e., it is non-rectangular quadrilateral — then nothing happens.Relatedly, the rectangle is still not extracted from the purple shape, due to the somewhat strict matching rule here, which only decomposes paths composed purely of four-point subpaths:
pdfminer.six/pdfminer/converter.py
Line 31 in 60863cf
pdfminer.six/pdfminer/converter.py
Lines 107 to 109 in 60863cf
I have attempted to fix these various bugs. The following changes seem to work, and appear to pass all of the repository's tests:
The result of the same test code now produces what seem, at least to me, like the correct results:
If you like this solution, I can file a PR with these changes. That said, I'm not 100% certain this handles all edge-cases, or that decomposition of all subpaths is necessarily desirable. (To me it seems so, but I understand that there may be other perspectives.)
Thank you for reading, and for all the work the
pdfminer.six
community has put into this project.The text was updated successfully, but these errors were encountered: