Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in new .paint_path logic #473

Closed
jsvine opened this issue Aug 20, 2020 · 3 comments
Closed

Bug in new .paint_path logic #473

jsvine opened this issue Aug 20, 2020 · 3 comments

Comments

@jsvine
Copy link
Contributor

jsvine commented Aug 20, 2020

Bug report

First off, thank you for this wonderful library! This is my first time opening an issue on this project, but I've been watching (and using and appreciating) pdfminer.six for a long time.

Commit 60863cf introduced new logic (still present at develop) for PDFLayoutAnalyzer.paint_path(...), via PR #371, aimed at fixing issue #369. Unfortunately, the changes appear to introduce new bugs. Specifically: The new code appears to ignore/reject all non-rectangle mlllh-shape paths. It also appears not to fully decompose paths into their subpaths. Examples, details, and proposal below.

For demonstration purposes, I have created the following PDF: quadrilaterals.pdf

Here's a PNG rendering of the PDF, for ease of reference:

quadrilaterals

And here's how the paths are defined in the PDF:

Black square:

0 0 0 RG
2 w
10 90 m
90 90 l
90 10 l
10 10 l
h
S

Red bowtie:

1 0 0 RG
2 w
110 90 m
190 10 l
190 90 l
110 10 l
h
S

Green quadrilateral:

0 1 0 RG
2 w
210 90 m
290 60 l
290 10 l
210 10 l
h
S

Two blue rectangles:

0 0 1 RG
2 w
310 90 m
350 90 l
350 10 l
310 10 l
h
350 90 m
390 90 l
390 10 l
350 10 l
h
S

Purple rectangle + pentagon:

1 0 1 RG
2 w
410 90 m
445 90 l
445 10 l
410 10 l
h
455 70 m
475 90 l
490 70 l
490 10 l
455 10 l
h
S

To examine the differences in .paint_path before and after the commit, I'm using this short Python program:

from pdfminer.high_level import extract_pages
COLORS = {
    (0, 0, 0): "black",
    (1, 0, 0): "red",
    (0, 1, 0): "green",
    (0, 0, 1): "blue",
    (1, 0, 1): "purple",
}

page = list(extract_pages("quadrilaterals.pdf"))[0]
for i, el in enumerate(page):
    print(f"== {i} ==")
    print("→ " + COLORS[el.stroking_color])
    print("→ " + str(type(el).__name__.split(".")[-1]))
    print("→ " + str(el.pts) + "\n")

On 6a9269b, the commit preceding 60863cf, this is the result:

== 0 ==
→ black
→ LTRect
→ [(10, 90), (90, 90), (90, 10), (10, 10)]

== 1 ==
→ red
→ LTCurve
→ [(110, 90), (190, 10), (190, 90), (110, 10)]

== 2 ==
→ green
→ LTCurve
→ [(210, 90), (290, 60), (290, 10), (210, 10)]

== 3 ==
→ blue
→ LTCurve
→ [(310, 90), (350, 90), (350, 10), (310, 10), (350, 90), (390, 90), (390, 10), (350, 10)]

== 4 ==
→ purple
→ LTCurve
→ [(410, 90), (445, 90), (445, 10), (410, 10), (455, 70), (475, 90), (490, 70), (490, 10), (455, 10)]

As issue #369 correctly pointed out, the blue path is being parsed as one eight-point curve, rather than two rectangles. Likewise, the purple shape is being parsed as a nine-point curve, rather than a rectangle and a five-point curve. The other shapes (black, red, green) are being parsed correctly.

Commit 60863cf solves the issue with the blue path, but introduces new problems. Here is the output of that same test program:

== 0 ==
→ black
→ LTRect
→ [(10, 90), (90, 90), (90, 10), (10, 10)]

== 1 ==
→ blue
→ LTRect
→ [(310, 90), (350, 90), (350, 10), (310, 10)]

== 2 ==
→ blue
→ LTRect
→ [(350, 90), (390, 90), (390, 10), (350, 10)]

== 3 ==
→ purple
→ LTCurve
→ [(410, 90), (445, 90), (445, 10), (410, 10), (455, 70), (475, 90), (490, 70), (490, 10), (455, 10)]

As the output indicates, the red and green quadrilaterals are ignored entirely. The cause appears to stem from converting the if statement in the prior code to an elif statement here:

elif shape == 'mlllh':
# rectangle
(_, x0, y0) = path[0]
(_, x1, y1) = path[1]
(_, x2, y2) = path[2]
(_, x3, y3) = path[3]
(x0, y0) = apply_matrix_pt(self.ctm, (x0, y0))
(x1, y1) = apply_matrix_pt(self.ctm, (x1, y1))
(x2, y2) = apply_matrix_pt(self.ctm, (x2, y2))
(x3, y3) = apply_matrix_pt(self.ctm, (x3, y3))
if (x0 == x1 and y1 == y2 and x2 == x3 and y3 == y0) or \
(y0 == y1 and x1 == x2 and y2 == y3 and x3 == x0):
rect = LTRect(gstate.linewidth, (x0, y0, x2, y2), stroke,
fill, evenodd, gstate.scolor, gstate.ncolor)
self.cur_item.add(rect)

That is, if the path's shape is mlllh but the path is not a rectangle — i.e., it is non-rectangular quadrilateral — then nothing happens.

Relatedly, the rectangle is still not extracted from the purple shape, due to the somewhat strict matching rule here, which only decomposes paths composed purely of four-point subpaths:

RECTS = re.compile('^(mlllh)+$')

elif self.RECTS.match(shape):
for paths in zip(*(iter(path),) * 5):
self.paint_path(gstate, stroke, fill, evenodd, list(paths))

I have attempted to fix these various bugs. The following changes seem to work, and appear to pass all of the repository's tests:

--- a/pdfminer/converter.py
+++ b/pdfminer/converter.py
@@ -26,12 +26,10 @@ from . import utils
 log = logging.getLogger(__name__)
 
 
 class PDFLayoutAnalyzer(PDFTextDevice):
 
-    RECTS = re.compile('^(mlllh)+$')
-
     def __init__(self, rsrcmgr, pageno=1, laparams=None):
         PDFTextDevice.__init__(self, rsrcmgr)
         self.pageno = pageno
         self.laparams = laparams
         self._stack = []
@@ -72,52 +70,62 @@ class PDFLayoutAnalyzer(PDFTextDevice):
                        (self.cur_item.x0, self.cur_item.y0,
                         self.cur_item.x1, self.cur_item.y1))
         self.cur_item.add(item)
         return
 
+
     def paint_path(self, gstate, stroke, fill, evenodd, path):
         """Paint paths described in section 4.4 of the PDF reference manual"""
         shape = ''.join(x[0] for x in path)
+
+        # if path contains multiple subpaths, split them up and reprocess
+        if shape.count("m") > 1:
+            m_indices = [i for i, x in enumerate(shape) if x == "m"]
+            subpaths = [path[a:b] for a, b in zip(m_indices, m_indices[1:] + [None])]
+            for sp in subpaths:
+                self.paint_path(gstate, stroke, fill, evenodd, sp)
+            return
+
         if shape == 'ml':
-            # horizontal/vertical line
+            # single line segment
             (_, x0, y0) = path[0]
             (_, x1, y1) = path[1]
             (x0, y0) = apply_matrix_pt(self.ctm, (x0, y0))
             (x1, y1) = apply_matrix_pt(self.ctm, (x1, y1))
             if x0 == x1 or y0 == y1:
                 line = LTLine(gstate.linewidth, (x0, y0), (x1, y1), stroke,
                               fill, evenodd, gstate.scolor, gstate.ncolor)
                 self.cur_item.add(line)
+                return
 
-        elif shape == 'mlllh':
-            # rectangle
+        if shape == 'mlllh':
+            # possibly a rectangle
             (_, x0, y0) = path[0]
             (_, x1, y1) = path[1]
             (_, x2, y2) = path[2]
             (_, x3, y3) = path[3]
             (x0, y0) = apply_matrix_pt(self.ctm, (x0, y0))
             (x1, y1) = apply_matrix_pt(self.ctm, (x1, y1))
             (x2, y2) = apply_matrix_pt(self.ctm, (x2, y2))
             (x3, y3) = apply_matrix_pt(self.ctm, (x3, y3))
+
+            # confirmed to be a rectangle
             if (x0 == x1 and y1 == y2 and x2 == x3 and y3 == y0) or \
                     (y0 == y1 and x1 == x2 and y2 == y3 and x3 == x0):
                 rect = LTRect(gstate.linewidth, (x0, y0, x2, y2), stroke,
                               fill, evenodd, gstate.scolor, gstate.ncolor)
                 self.cur_item.add(rect)
-
-        elif self.RECTS.match(shape):
-            for paths in zip(*(iter(path),) * 5):
-                self.paint_path(gstate, stroke, fill, evenodd, list(paths))
-
-        else:
-            pts = []
-            for p in path:
-                for i in range(1, len(p), 2):
-                    pts.append(apply_matrix_pt(self.ctm, (p[i], p[i+1])))
-            curve = LTCurve(gstate.linewidth, pts, stroke, fill, evenodd,
-                            gstate.scolor, gstate.ncolor)
-            self.cur_item.add(curve)
+                return
+
+        # if not a rectangle or a line, treat as a curve
+        pts = []
+        for p in path:
+            for i in range(1, len(p), 2):
+                pts.append(apply_matrix_pt(self.ctm, (p[i], p[i+1])))
+        curve = LTCurve(gstate.linewidth, pts, stroke, fill, evenodd,
+                        gstate.scolor, gstate.ncolor)
+        self.cur_item.add(curve)
 
     def render_char(self, matrix, font, fontsize, scaling, rise, cid, ncs,
                     graphicstate):
         try:
             text = font.to_unichr(cid)

The result of the same test code now produces what seem, at least to me, like the correct results:

== 0 ==
→ black
→ LTRect
→ [(10, 90), (90, 90), (90, 10), (10, 10)]

== 1 ==
→ red
→ LTCurve
→ [(110, 90), (190, 10), (190, 90), (110, 10)]

== 2 ==
→ green
→ LTCurve
→ [(210, 90), (290, 60), (290, 10), (210, 10)]

== 3 ==
→ blue
→ LTRect
→ [(310, 90), (350, 90), (350, 10), (310, 10)]

== 4 ==
→ blue
→ LTRect
→ [(350, 90), (390, 90), (390, 10), (350, 10)]

== 5 ==
→ purple
→ LTRect
→ [(410, 90), (445, 90), (445, 10), (410, 10)]

== 6 ==
→ purple
→ LTCurve
→ [(455, 70), (475, 90), (490, 70), (490, 10), (455, 10)]

If you like this solution, I can file a PR with these changes. That said, I'm not 100% certain this handles all edge-cases, or that decomposition of all subpaths is necessarily desirable. (To me it seems so, but I understand that there may be other perspectives.)

Thank you for reading, and for all the work the pdfminer.six community has put into this project.

@pietermarsman
Copy link
Member

Thanks for the thorough analysis! I would really like it if you file a PR.

This method is ideal for unit testing, and we have some of them. But perhaps you could add your examples to the unit test as well. See test_converter.py.

jsvine added a commit to jsvine/pdfminer.six that referenced this issue Sep 26, 2020
Focuses on the decomposition of complex (m.*h)* paths into subpaths, and
assigning those subpaths the correct LTCurve/LTRect type.
jsvine added a commit to jsvine/pdfminer.six that referenced this issue Sep 26, 2020
Focuses on the decomposition of complex (m.*h)* paths into subpaths, and
assigning those subpaths the correct LTCurve/LTRect type.

Also adds a test for cases presented in issue pdfminer#473
jsvine added a commit to jsvine/pdfminer.six that referenced this issue Sep 26, 2020
Focuses on the handling of non-rect quadrilaterals, the decomposition of
complex (m.*h)* paths into subpaths, and assigning those subpaths the
correct LTCurve/LTRect type.

Also adds a test for cases presented in issue pdfminer#473
@jsvine
Copy link
Contributor Author

jsvine commented Sep 26, 2020

Thanks! PR now submitted, including a test of the examples above.

jsvine added a commit to jsvine/pdfminer.six that referenced this issue Sep 26, 2020
Focuses on the handling of non-rect quadrilaterals, the decomposition of
complex (m.*h)* paths into subpaths, and assigning those subpaths the
correct LTCurve/LTRect type.

Also adds a test for cases presented in issue pdfminer#473
jsvine added a commit to jsvine/pdfminer.six that referenced this issue Sep 26, 2020
Focuses on the handling of non-rect quadrilaterals, the decomposition of
complex (m.*h)* paths into subpaths, and assigning those subpaths the
correct LTCurve/LTRect type.

Also adds a test for cases presented in issue pdfminer#473
pietermarsman added a commit that referenced this issue Oct 12, 2020
* Fix paint_path bug noted in issue #473

Focuses on the handling of non-rect quadrilaterals, the decomposition of
complex (m.*h)* paths into subpaths, and assigning those subpaths the
correct LTCurve/LTRect type.

Also adds a test for cases presented in issue #473

* Tweak paint_path fix per @pietermarsman review

- Adjusts logic to adhere to if-elif-else rather than early returns.

- Shortens subpath detection/reprocessing step, using re.finditer().

* Reorder paint_path() if-else statements once more

* Fix flake8 issues

* Fix error: should select item 1 and 2 from the list, and possible items [3, 4], and so on.

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
@pietermarsman
Copy link
Member

pietermarsman commented Oct 18, 2020

Closed by #512

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants