Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix converting path to multiple rectangles #371

Merged
merged 10 commits into from
Jul 11, 2020

Conversation

cheungpat
Copy link
Contributor

@cheungpat cheungpat commented Feb 6, 2020

Description

For path that consists of a series of rectangles (shape is 'mlllhmlllh...'), call paint_path again with each group of 5 points. The result is multiple rects instead of a single curve.

This fixes #369 which affects PDF generated from Excel and Microsoft Print to PDF. The PDF generated by this method will have borders drawn using a single path instead of multiple rectangles. This allows tables to be drawn as expected when using pdf2txt -t html and also when using pdfplumber (which does not consider rect edges from curves).

Fixes #369

How Has This Been Tested?

Added a test case to parse such PDF and inspect the changes by calling pdf2txt -t html <pdf> before/after the change. The pdf2txt -t xml will also generate <rect> instead of <curve>

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the README.md and other documentation, or I am sure that this is not necessary
  • I have added a consice human-readable description of the change to CHANGELOG.md
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial version

For path that consists of a series of rectangles
(shape is 'mlllhmlllh...'), call paint_path again with each group of
5 points. The result is multiple rects instead of a single curve.

fixes pdfminer#369
Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code changes look good!

I would like to have some unittests for this, instead of a PDF integration test.

A suggestion: could you add a reference in the docstring of paint_path() to the Section 4.4 of the PDF Reference ("Path Construction and Painting") makes it easier for everyone to understand this method.

tests/test_tools_pdf2txt.py Outdated Show resolved Hide resolved
@pietermarsman
Copy link
Member

@cheungpat this is a friendly reminder that this PR still needs some work :)

@pietermarsman pietermarsman merged commit 60863cf into pdfminer:develop Jul 11, 2020
@pietermarsman
Copy link
Member

@cheungpat Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tables drawn from single path is converted to curve instead of rects
2 participants