Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcv2hocr doesn't rectify negative coordinates in GCV API response #39

Open
SoloSynth1 opened this issue Mar 24, 2021 · 0 comments
Open

Comments

@SoloSynth1
Copy link

According to the hOCR standard (Latest is v1.2 as of March 2021), the bbox property specifies uint to be used. That means all values must be unsigned. (http://kba.cloud/hocr-spec/1.2/#propdef-bbox)

However, the textAnnotation API response from GCV will provide negative coordinates for some out-of-bound boxes, such as the example below:

{
  "description": "2-3/4300/62",
  "boundingPoly": {
    "vertices": [
      {
        "x": 4727,
        "y": -1
      },
      {
        "x": 4927,
        "y": 0
      },
      {
        "x": 4927,
        "y": 44
      },
      {
        "x": 4727,
        "y": 43
      }
    ],
    "normalizedVertices": []
  },
  "mid": "",
  "locale": "",
  "score": 0,
  "confidence": 0,
  "topicality": 0,
  "locations": [],
  "properties": []
}

In the current gcv2hocr script, such case will be parsed into .hocr file without retification, resulting in lines like this:

<span class='ocr_line' id='line_1_2' title="bbox 4727 -2 4927 44 ; baseline 0 -5; x_size 89; x_descenders 20; x_ascenders 21"><span class='ocrx_word' id='word_1_2' title='bbox 4727 -2 4927 44 ; x_wconf 85' lang='eng' dir='ltr'>  2-3/4300/62  </span>

This is causing hocr-pdf to error when trying to parse this illegal ocr_line.
While hocr-pdf seems to work just fine by altering the parsing regex rule, It would be great if the script can implement some form of retification on the negative values in order to adhere with the cureent hOCR standard, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant