Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some utility methods for logical structure #1095

Merged
merged 10 commits into from
Mar 3, 2024

Conversation

dhdaines
Copy link
Contributor

It's useful to be able to search in the structure tree - this has to be done from the PDFStructTree object itself since we return a dictionary from structure_tree in keeping with the general way of pdfplumber.

Also to get a BBox from an element for visual debugging - note the FIXME, if you play games with cropped pages, this will fail, but in general that's unlikely, you would have to do something like:

pdf = pdfplumber.open(pdffile)
page = pdf.pages[0].crop(some_bbox)
stree = PDFStructTree(pdf, page)  # NO! Don't do this! Why would you do this?

and then try to get the BBox of an element where it is explicitly specified in the attributes of that element (usually this is only the case for Figure and Table).

Is there a general method to properly transform PDF BBoxes into pdfplumber ones for a page?

Copy link

codecov bot commented Feb 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (3e74fb1) 100.00% compared to head (efeb080) 100.00%.
Report is 1 commits behind head on develop.

Additional details and impacted files
@@            Coverage Diff            @@
##           develop     #1095   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           19        19           
  Lines         1928      1996   +68     
=========================================
+ Hits          1928      1996   +68     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jsvine jsvine changed the base branch from stable to develop February 16, 2024 17:03
@jsvine
Copy link
Owner

jsvine commented Feb 16, 2024

I haven't played around with this yet, but it seems like a reasonable idea and doesn't interfere with core pdfplumber functionality, so I'm inclined to merge. Is it ready to merge?

Is there a general method to properly transform PDF BBoxes into pdfplumber ones for a page?

If I'm understanding correctly, this this question pertains to flipping the vertical coordinates, so that (x0, y0, x1, y1) (i.e., bbox with origin at the bottom-left) becomes (x0, top, x1, bottom) (origin at top-left). Is that right? If so: We just calculate the top and bottom attributes once, on parsing:

if "y0" in attr:
attr["top"] = self.height - attr["y1"]
attr["bottom"] = self.height - attr["y0"]
attr["doctop"] = self.initial_doctop + attr["top"]

... and then use (x0, top, x1, bottom) as the standard bbox throughout.

@dhdaines
Copy link
Contributor Author

I haven't played around with this yet, but it seems like a reasonable idea and doesn't interfere with core pdfplumber functionality, so I'm inclined to merge. Is it ready to merge?

I think maybe I'll add a companion / convenience method find to get just the first instance of an element, and at least minimally handle the issue below (it's a bit more complicated, basically I can use the example code above as a unit test).

Is there a general method to properly transform PDF BBoxes into pdfplumber ones for a page?

If I'm understanding correctly, this this question pertains to flipping the vertical coordinates, so that (x0, y0, x1, y1) (i.e., bbox with origin at the bottom-left) becomes (x0, top, x1, bottom) (origin at top-left). Is that right? If so: We just calculate the top and bottom attributes once, on parsing:

The issue is a bit more complicated because when you crop a page, all of the object coordinates go through the crop_fn which adjusts them. So far, so good, but structure tree elements can have a BBox attribute specified on them which is not attached to any particular object. In element_bbox we will prefer this if it exists, but then we will also have to transform it, not only flipping the vertical coordinates, but also applying crop_fn.

I can probably hack something up so this minimally works but we might want to refactor it at some point.

@dhdaines
Copy link
Contributor Author

Should be ready to merge now!

I didn't realize that cropping a page doesn't actually translate the coordinates of the objects, it just clips them to the new bounding box - nonetheless, this didn't work right for structure elements with BBox attributes, and now it does.

@jsvine jsvine merged commit 207312e into jsvine:develop Mar 3, 2024
7 checks passed
@jsvine
Copy link
Owner

jsvine commented Mar 3, 2024

Thanks, now merged! (And correct re. the non-translation of coordinates.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants