Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAParams boxes_flow documentation inconsistency #395

Closed
jstockwin opened this issue Mar 16, 2020 · 1 comment · Fixed by #396
Closed

LAParams boxes_flow documentation inconsistency #395

jstockwin opened this issue Mar 16, 2020 · 1 comment · Fixed by #396

Comments

@jstockwin
Copy link
Member

The documentation for boxes_flow of LAParams (here) reads:

Specifies how much a horizontal and vertical position of a text matters when determining the order of text boxes. The value should be within the range of -1.0 (only horizontal position matters) to +1.0 (only vertical position matters).

whereas the code (here) does the following

        if -1 <= laparams.boxes_flow and laparams.boxes_flow <= +1 \
                and textboxes:
            # Code to do the full layout analysis, adding groups etc
        else:
            def getkey(box):
                if isinstance(box, LTTextBoxVertical):
                    return (0, -box.x1, box.y0)
                else:
                    return (1, box.y0, box.x0)
            textboxes.sort(key=getkey)

From the code, it seems that it is possible to set boxes_flow outside the range [-1.0, 1.0]. At the moment, any value outside this range will return boxes in an order based on their position. Unfortunately, you can't currently pass boxes_flow as None because you can't do None <= 1. Therefore, to disable this you essentially have to set boxes_flow to e.g. 2.

(Aside: This order is perhaps a bit strange when there are both horizontal and vertical text boxes on the page. I'd be temped to choose a scheme, perhaps depending on if there are more horizontal or vertical boxes on the page, but that's a different issue and probably a breaking change.)

Suggestions:

  • The documentation should be updated to explain that this can be disabled.
  • I think it would be nicer if the value for "disabled" was None, rather than any value in [-1, 1].
  • If the valid range is [-1, 1] perhaps this should be validated, throwing an exception if this isn't the case.
@pietermarsman
Copy link
Member

You are totally right in this analysis.

Regarding the fix, I think it's best to be explicit. Either allow a value in [-1, 1] or None but not anything else.

pietermarsman pushed a commit that referenced this issue Mar 26, 2020
* Update documentation for boxes_flow, allow None

* Apply comments from code review

* Small wording changes, remove unnecessary comment

* Update boxes_flow documentation for pdf2text

* Pin version of tox to ensure python 3.4 support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants