Page.hyperlinks breaks on CroppedPages #1171

Safrone · 2024-07-11T20:22:48Z

Describe the bug

calling .hyperlinks on a CroppedPage raises AttributeErrors

Code to reproduce the problem

Paste it here, or attach a Python file.

>>> import pdfplumber
>>> pdf = pdfplumber.open(pdf_path)
>>> page = pdf.pages[0]
>>> page.hyperlinks
[{'page_number': 1, 'object_type': 'annot', 'x0': 472.82489, 'y0': 38.58897000000002, 'x1': 538.5827, 'y1': 46.58897000000002,
  'doctop': 795.3008, 'top': 795.3008, 'bottom': 803.3008, 'width': 65.75781000000006, 'height': 8.0,
  'uri': 'https://www.antennahouse.com/', 'title': None, 'contents': 'Antenna House, Inc.', 
  'data': {'Type': /'Annot', 'Subtype': /'Link', 'Rect': [472.82489, 38.58897, 538.5827, 46.58897], 'Contents': b'Antenna House, Inc.',
              'M': b"D:20240123093922+09'00'", 'Border': [0, 0, 0], 'A': {'S': /'URI', 'URI': b'https://www.antennahouse.com/'}}},
 {'page_number': 1, 'object_type': 'annot', 'x0': 56.69292, 'y0': 491.54333, 'x1': 275.99957, 'y1': 501.54333, 
  'doctop': 340.34644, 'top': 340.34644, 'bottom': 350.34644, 'width': 219.30665, 'height': 10.0, 
  'uri': 'https://www.antennahouse.com/', 'title': None, 'contents': 'Linking to a website (https://www.antennahouse.com/)', 
  'data': {'Type': /'Annot', 'Subtype': /'Link', 'Rect': [56.69292, 491.54333, 275.99957, 501.54333], 'Contents': b'Linking to a website (https://www.antennahouse.com/)',
              'M': b"D:20240123093922+09'00'", 'Border': [0, 0, 0], 'A': {'S': /'URI', 'URI': b'https://www.antennahouse.com/'}}}]

>>> page.crop(page.bbox).hyperlinks
Traceback (most recent call last):
  File "./python3.9/site-packages/IPython/core/interactiveshell.py", line 3550, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-24-cc1c4894cb59>", line 1, in <module>
    page.crop(page.bbox).hyperlinks
  File "./python3.9/site-packages/pdfplumber/page.py", line 335, in hyperlinks
    return [a for a in self.annots if a["uri"] is not None]
  File "./python3.9/site-packages/pdfplumber/page.py", line 331, in annots
    return list(map(parse, raw))
  File "./python3.9/site-packages/pdfplumber/page.py", line 292, in parse
    pt0 = rotate_point((_a, _b), self.rotation)
AttributeError: 'CroppedPage' object has no attribute 'rotation'

PDF file

example pdf for testing:
https://www.antennahouse.com/hubfs/xsl-fo-sample/pdf/basic-link-1.pdf

Expected behavior

.hyperlinks should work on CroppedPage objects and it should only return hyperlinks contained within the cropped area

Environment

pdfplumber version: 0.11.2
Python version: 3.9.18
OS: Linux

The text was updated successfully, but these errors were encountered:

@Safrone

Issue was caused by missing `.initial_doctop` and `.rotation` properties. h/t @Safrone in #1171

jsvine · 2024-07-14T21:30:49Z

Many thanks for flagging this in such a clear and reproducible bug report, @Safrone. Should be fixed now on the develop branch in e5737d2, and will go in the next release.

Safrone · 2024-07-18T16:57:13Z

Many thanks for flagging this in such a clear and reproducible bug report, @Safrone. Should be fixed now on the develop branch in e5737d2, and will go in the next release.

This does avoid raising an exception, though it does not appear that hyperlinks are filtered based on cropped area:

reproducible example of what I mean:

>>> import pdfplumber
>>> pdf = pdfplumber.open(pdf_path)
>>> page = pdf.pages[0]
>>> page.hyperlinks
[{'page_number': 1, 'object_type': 'annot', 'x0': 472.82489, 'y0': 38.58897000000002, 'x1': 538.5827, 'y1': 46.58897000000002,
  'doctop': 795.3008, 'top': 795.3008, 'bottom': 803.3008, 'width': 65.75781000000006, 'height': 8.0,
  'uri': 'https://www.antennahouse.com/', 'title': None, 'contents': 'Antenna House, Inc.', 
  'data': {'Type': /'Annot', 'Subtype': /'Link', 'Rect': [472.82489, 38.58897, 538.5827, 46.58897], 'Contents': b'Antenna House, Inc.',
              'M': b"D:20240123093922+09'00'", 'Border': [0, 0, 0], 'A': {'S': /'URI', 'URI': b'https://www.antennahouse.com/'}}},
 {'page_number': 1, 'object_type': 'annot', 'x0': 56.69292, 'y0': 491.54333, 'x1': 275.99957, 'y1': 501.54333, 
  'doctop': 340.34644, 'top': 340.34644, 'bottom': 350.34644, 'width': 219.30665, 'height': 10.0, 
  'uri': 'https://www.antennahouse.com/', 'title': None, 'contents': 'Linking to a website (https://www.antennahouse.com/)', 
  'data': {'Type': /'Annot', 'Subtype': /'Link', 'Rect': [56.69292, 491.54333, 275.99957, 501.54333], 'Contents': b'Linking to a website (https://www.antennahouse.com/)',
              'M': b"D:20240123093922+09'00'", 'Border': [0, 0, 0], 'A': {'S': /'URI', 'URI': b'https://www.antennahouse.com/'}}}]
>>> link = page.hyperlinks[0]
>>> links = page.crop((link['x0'], link['top'], link['x1'], link['bottom'])).hyperlinks
>>> assert len(links) == 1
>>> assert links[0] == link

@Safrone

h/t @Safrone in #1171

jsvine · 2024-08-04T17:15:22Z

Ah, good catch, thanks! Now fixed in 22494e8

Safrone added the bug label Jul 11, 2024

jsvine added a commit that referenced this issue Jul 14, 2024

Fix broken CroppedPage.annots/hyperlinks

e5737d2

Issue was caused by missing `.initial_doctop` and `.rotation` properties. h/t @Safrone in #1171

jsvine closed this as completed Jul 14, 2024

jsvine added a commit that referenced this issue Aug 4, 2024

Make Page.crop(...) also crop .annots/.hyperlinks

22494e8

h/t @Safrone in #1171

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page.hyperlinks breaks on CroppedPages #1171

Page.hyperlinks breaks on CroppedPages #1171

Safrone commented Jul 11, 2024

jsvine commented Jul 14, 2024

Safrone commented Jul 18, 2024

jsvine commented Aug 4, 2024

Page.hyperlinks breaks on CroppedPages #1171

Page.hyperlinks breaks on CroppedPages #1171

Comments

Safrone commented Jul 11, 2024

Describe the bug

Code to reproduce the problem

PDF file

Expected behavior

Environment

jsvine commented Jul 14, 2024

Safrone commented Jul 18, 2024

jsvine commented Aug 4, 2024