Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix to issue #94 #95

Merged
merged 2 commits into from
Feb 2, 2022
Merged

fix to issue #94 #95

merged 2 commits into from
Feb 2, 2022

Conversation

kforcodeai
Copy link
Contributor

Fixes # #94 (comment)
#94
The issue was, all digit sequences were inferred as float, with this fix all text (numeric + non-numeric) will be inferred as string and the user can change it to their desired data type.
But with this fix, the user will be required to change the numeric data type columns.
i could not find any better solution other than this.

now all text will inferred as string and the user can change it to their desired data type.
@kforcodeai kforcodeai changed the title fix to https://github.com/Layout-Parser/layout-parser/issues/94#issue… fix to issue #94 Oct 31, 2021
_cols.remove('text')
for col in _cols:
_df[col] = _df[col].astype(int)
res['data'] = _df
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try the following code:

_data = pytesseract.image_to_data(img_content, lang=self.lang, **self.configs)
df = pd.read_csv(
   io.StringIO(_data), quoting=csv.QUOTE_NONE, encoding="utf-8", sep="\t"
)
df['text'] = df['text'].astype('str')
res["data"] = df

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lolipopshock sorry it does not, I have tried this
and ya i get it, the for loop and all that stuff looks ugly :)

here's the screenshot
layout_parse

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- it's the issue from floating point numbers .0 right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@lolipopshock lolipopshock reopened this Feb 2, 2022
@lolipopshock
Copy link
Member

lolipopshock commented Feb 2, 2022

I think the new solution can solve your issue -- see example below:

Let's say we have a csv file test.csv:

Col_A, Col_B
, 1
2, 3
245.0, 

And if we read it via:

df = pd.read_csv("test.csv", converters={"Col_A": str})

We have

Test B
  1
2 3
245.0

(There's no .0 for 2 in the 2nd row and 1st col.

@lolipopshock lolipopshock merged commit 0809fa8 into Layout-Parser:master Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants