Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix .paint_path handling of single line segments #530

Merged
merged 16 commits into from
Jul 27, 2021

Conversation

jsvine
Copy link
Contributor

@jsvine jsvine commented Oct 21, 2020

Thank you for merging my previous PR (#512) re. PDFLayoutAnalyzer.paint_path(...). That PR focused on the handling of non-rectangular quadrilaterals and of subpaths. I neglected, however, to closely examine (or test) the handling of individual line segments, and largely carried over the previous logic. This PR attempts to fix that logic:

  • Fixes typo ("ml" should be "mlh"), which was causing all single-line segments to be ignored.

  • Removes the if-statement that required individual line segments to be strictly horizontal or vertical. (If the maintainers believe, however, that diagonal lines should be instances of LTCurve instead of LTLine, however, that's an easy adjustment.)

How Has This Been Tested?

I have added an assertion to tests/test_converter.py: jsvine@4ef21ec#diff-f56d97c6216a37d8ca841f31c63b704d05b7e3a42d6dbe783411d6fef8204615R98-R112

Checklist

  • I have added tests that prove my fix is effective or that my feature
    works
  • (n/a) I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial
    version
  • (n/a) I have updated the README.md or I am verified that this
    is not necessary
  • (n/a) I have updated the readthedocs documentation or I
    verified that this is not necessary
  • I have added a consice human-readable description of the change to
    CHANGELOG.md

- Fixes typo ("ml" should have been "mlh")

- Removes if-statement that required individual line segments to be
  strictly horizontal or vertical.
pdfminer/converter.py Outdated Show resolved Hide resolved
pdfminer/converter.py Outdated Show resolved Hide resolved
Althoguh 'mlh' is the canonical implementation for a single line
segment, 'ml' is fairly common.

Adds tests and sample PDF.
@wind-chh
Copy link

I'm an user of both pdfminer.six and pdfplumber and appreciate for all your great work!
For me, a line/curve without "h" is common, hope you can support this situation.

This commit corrects the manner in which "pts" are extracted from Beziér
path commands. See Table 4.9 of PDF reference manual, and new comments
in code for details. Previously, depending on whether the command (c,
v, or y) the code was extracting some combination of control points (not
on curve) and the actual points-on-curve.

This commit also refactors .paint_path, so that apply_matrix_pt is only
called in one place, and to treat the "h" command in a manner more
consistent with other path commands.
Now that .paint_path has been refactored, adding support for
rect-forming mllll paths requires no extra code, beyond a minor tweak to
the relevant elif statement.
@jsvine
Copy link
Contributor Author

jsvine commented Feb 7, 2021

The new commits above do three main things, described below.

Fixing the point-extraction from Beziér path commands

I spent some time reading the PDF spec and .paint_path more closely, and realized that the previous implementation was not parsing the c, v, and y path commands, which describe Beziér curves, in the manner that I would have expected them to. Per Table 4.9 of the spec, the these commands specify their point-positions in as their final two arguments.

You're probably familiar with that table, but I'm pasting it here for others who might not be, and for ease of reference:

Screen Shot 2021-02-07 at 1 52 46 PM

Previously, .paint_path was instead extracting all points in the Beziér commands, including the control points, and adding them to the LTCurve.pts attribute. Since control points are not on the path itself, and LTCurve objects do not provide the information that would allow a user to distinguish between points-on-curve and control points, the proposed changes here (adding only points-on-curve to .pts) seems more correct to me. I am, however, certainly open to disagreement.

(Relatedly: Are the maintainers open to adding a .path attribute (which would contain the full path-command information that .paint_path is using) to LTCurve objects? That would allow users to reproduce the Beziér curves themselves — currently not possible because the command types themselves (m/l/h/c/v/y) are not accessible through .pts. If that sounds like a good idea, I can submit a separate PR once this one is resolved.)

Refactoring .paint_path

These new commits also refactor .paint_path so that (a) apply_matrix_pt is only called in one place, and (b) the "h" command is handled in a manner more consistent with other path commands. (I.e., It is assigned a point-position.)

Handling rect-forming mllll paths

Now that .paint_path has been refactored, adding support for rect-forming mllll paths requires no extra code, beyond a minor tweak to the relevant elif statement. Above, @pietermarsman expressed an initial reservation about this:

Its tempting to make the rect-catcher also more generic, but I prefer to implement generic stuff when we have an example PDF that show that its happening.

Here is a PDF in the wild that draws a rectangle (the outer boundary of the main table) as an mllll path. Here is the actual set of commands, copied from a decompressed version of the PDF:

21.83 311.51 m
21.83 43.64 l
592.44 43.64 l
592.44 311.51 l
21.83 311.51 l
S

@jsvine
Copy link
Contributor Author

jsvine commented Feb 7, 2021

(Note: The initial post-commit/push Travis build appears to have failed — but because of a problem installing the cryptography Python module, rather than because of the changes in this PR. Locally, tests are passing for me.)

@pietermarsman
Copy link
Member

Travis seem to be migrated and therefore not working anymore... I'm trying to figure out how to get the tests working again.

@pietermarsman pietermarsman merged commit 016239c into pdfminer:develop Jul 27, 2021
@jsvine
Copy link
Contributor Author

jsvine commented Aug 11, 2021

Many thanks for the earlier review @pietermarsman, and for merging! Much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants