Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page.hyperlinks breaks on CroppedPages #1171

Closed
Safrone opened this issue Jul 11, 2024 · 3 comments
Closed

Page.hyperlinks breaks on CroppedPages #1171

Safrone opened this issue Jul 11, 2024 · 3 comments
Labels

Comments

@Safrone
Copy link

Safrone commented Jul 11, 2024

Describe the bug

calling .hyperlinks on a CroppedPage raises AttributeErrors

Code to reproduce the problem

Paste it here, or attach a Python file.

>>> import pdfplumber
>>> pdf = pdfplumber.open(pdf_path)
>>> page = pdf.pages[0]
>>> page.hyperlinks
[{'page_number': 1, 'object_type': 'annot', 'x0': 472.82489, 'y0': 38.58897000000002, 'x1': 538.5827, 'y1': 46.58897000000002,
  'doctop': 795.3008, 'top': 795.3008, 'bottom': 803.3008, 'width': 65.75781000000006, 'height': 8.0,
  'uri': 'https://www.antennahouse.com/', 'title': None, 'contents': 'Antenna House, Inc.', 
  'data': {'Type': /'Annot', 'Subtype': /'Link', 'Rect': [472.82489, 38.58897, 538.5827, 46.58897], 'Contents': b'Antenna House, Inc.',
              'M': b"D:20240123093922+09'00'", 'Border': [0, 0, 0], 'A': {'S': /'URI', 'URI': b'https://www.antennahouse.com/'}}},
 {'page_number': 1, 'object_type': 'annot', 'x0': 56.69292, 'y0': 491.54333, 'x1': 275.99957, 'y1': 501.54333, 
  'doctop': 340.34644, 'top': 340.34644, 'bottom': 350.34644, 'width': 219.30665, 'height': 10.0, 
  'uri': 'https://www.antennahouse.com/', 'title': None, 'contents': 'Linking to a website (https://www.antennahouse.com/)', 
  'data': {'Type': /'Annot', 'Subtype': /'Link', 'Rect': [56.69292, 491.54333, 275.99957, 501.54333], 'Contents': b'Linking to a website (https://www.antennahouse.com/)',
              'M': b"D:20240123093922+09'00'", 'Border': [0, 0, 0], 'A': {'S': /'URI', 'URI': b'https://www.antennahouse.com/'}}}]

>>> page.crop(page.bbox).hyperlinks
Traceback (most recent call last):
  File "./python3.9/site-packages/IPython/core/interactiveshell.py", line 3550, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-24-cc1c4894cb59>", line 1, in <module>
    page.crop(page.bbox).hyperlinks
  File "./python3.9/site-packages/pdfplumber/page.py", line 335, in hyperlinks
    return [a for a in self.annots if a["uri"] is not None]
  File "./python3.9/site-packages/pdfplumber/page.py", line 331, in annots
    return list(map(parse, raw))
  File "./python3.9/site-packages/pdfplumber/page.py", line 292, in parse
    pt0 = rotate_point((_a, _b), self.rotation)
AttributeError: 'CroppedPage' object has no attribute 'rotation'

PDF file

example pdf for testing:
https://www.antennahouse.com/hubfs/xsl-fo-sample/pdf/basic-link-1.pdf

Expected behavior

.hyperlinks should work on CroppedPage objects and it should only return hyperlinks contained within the cropped area

Environment

  • pdfplumber version: 0.11.2
  • Python version: 3.9.18
  • OS: Linux
@Safrone Safrone added the bug label Jul 11, 2024
jsvine added a commit that referenced this issue Jul 14, 2024
Issue was caused by missing `.initial_doctop` and `.rotation`
properties. h/t @Safrone in #1171
@jsvine
Copy link
Owner

jsvine commented Jul 14, 2024

Many thanks for flagging this in such a clear and reproducible bug report, @Safrone. Should be fixed now on the develop branch in e5737d2, and will go in the next release.

@jsvine jsvine closed this as completed Jul 14, 2024
@Safrone
Copy link
Author

Safrone commented Jul 18, 2024

Many thanks for flagging this in such a clear and reproducible bug report, @Safrone. Should be fixed now on the develop branch in e5737d2, and will go in the next release.

This does avoid raising an exception, though it does not appear that hyperlinks are filtered based on cropped area:

reproducible example of what I mean:

>>> import pdfplumber
>>> pdf = pdfplumber.open(pdf_path)
>>> page = pdf.pages[0]
>>> page.hyperlinks
[{'page_number': 1, 'object_type': 'annot', 'x0': 472.82489, 'y0': 38.58897000000002, 'x1': 538.5827, 'y1': 46.58897000000002,
  'doctop': 795.3008, 'top': 795.3008, 'bottom': 803.3008, 'width': 65.75781000000006, 'height': 8.0,
  'uri': 'https://www.antennahouse.com/', 'title': None, 'contents': 'Antenna House, Inc.', 
  'data': {'Type': /'Annot', 'Subtype': /'Link', 'Rect': [472.82489, 38.58897, 538.5827, 46.58897], 'Contents': b'Antenna House, Inc.',
              'M': b"D:20240123093922+09'00'", 'Border': [0, 0, 0], 'A': {'S': /'URI', 'URI': b'https://www.antennahouse.com/'}}},
 {'page_number': 1, 'object_type': 'annot', 'x0': 56.69292, 'y0': 491.54333, 'x1': 275.99957, 'y1': 501.54333, 
  'doctop': 340.34644, 'top': 340.34644, 'bottom': 350.34644, 'width': 219.30665, 'height': 10.0, 
  'uri': 'https://www.antennahouse.com/', 'title': None, 'contents': 'Linking to a website (https://www.antennahouse.com/)', 
  'data': {'Type': /'Annot', 'Subtype': /'Link', 'Rect': [56.69292, 491.54333, 275.99957, 501.54333], 'Contents': b'Linking to a website (https://www.antennahouse.com/)',
              'M': b"D:20240123093922+09'00'", 'Border': [0, 0, 0], 'A': {'S': /'URI', 'URI': b'https://www.antennahouse.com/'}}}]
>>> link = page.hyperlinks[0]
>>> links = page.crop((link['x0'], link['top'], link['x1'], link['bottom'])).hyperlinks
>>> assert len(links) == 1
>>> assert links[0] == link

@jsvine
Copy link
Owner

jsvine commented Aug 4, 2024

Ah, good catch, thanks! Now fixed in 22494e8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants