Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tables drawn from single path is converted to curve instead of rects #369

Closed
cheungpat opened this issue Feb 5, 2020 · 1 comment · Fixed by #371
Closed

Tables drawn from single path is converted to curve instead of rects #369

cheungpat opened this issue Feb 5, 2020 · 1 comment · Fixed by #371

Comments

@cheungpat
Copy link
Contributor

Describe the bug

When using Excel and Print to PDF function, borders from the generated PDF cannot be converted to HTML output. When converting to XML output, the borders become <curve> instead of <rect>.

It appears that the borders are rendered as a single path and hence it is interpreted as a curve instead of a rect.

To Reproduce

Run pdf2txt.py -t html output.pdf > output.html.

output.pdf

output.html: (some borders are missing)

Screenshot 2020-02-06 at 1 39 30 AM

When converting to xml, the borders become <curve> instead of <rect>.

Expected behavior

output.html: (borders should be shown)
Screenshot 2020-02-06 at 1 39 48 AM

cheungpat added a commit to cheungpat/pdfminer.six that referenced this issue Feb 6, 2020
For path that consists of a series of rectangles
(shape is 'mlllhmlllh...'), call paint_path again with each group of
5 points. The result is multiple rects instead of a single curve.

fixes pdfminer#369
@pietermarsman
Copy link
Member

Thanks for raising this issue!

I'll review your PR and see if we can merge it.

pietermarsman added a commit that referenced this issue Jul 11, 2020
* Fix converting path to multiple rectangles

For path that consists of a series of rectangles
(shape is 'mlllhmlllh...'), call paint_path again with each group of
5 points. The result is multiple rects instead of a single curve.

fixes #369

* Reduce pdf size by removing font

* Add unittest for PDFLayoutAnalyzer.paint_path()

* Add line to CHANGELOG.md

* Add reference to pdf reference manual

* Cleanup function paint_path a bit

* Reduce line length of tests

* Reduce line length of tests

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants