Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove _fix_cid_text #60

Open
Tracked by #97
heijul opened this issue Apr 13, 2023 · 2 comments
Open
Tracked by #97

Remove _fix_cid_text #60

heijul opened this issue Apr 13, 2023 · 2 comments
Labels
bug Something isn't working garbage This is unnecessary

Comments

@heijul
Copy link
Owner

heijul commented Apr 13, 2023

The problem is not fixable. See the pdfminer.six faq. Chars that can not be read should only be used as delimiters for things like #46 or similar.

@heijul heijul added bug Something isn't working garbage This is unnecessary labels Apr 13, 2023
@heijul
Copy link
Owner Author

heijul commented Apr 28, 2023

The current approach actually works for some cid codes, namely those for the german umlauts.

@heijul
Copy link
Owner Author

heijul commented May 20, 2023

The current approach actually works for some cid codes, namely those for the german umlauts.

I doubt we can assume this in general.

Instead, we could do the following:

  • Store the PDF BBox and the page of each cid char
  • use the original, not preprocessed, PDF to extract the glyph (image)
  • use a OCR library to detect the char from the image
  • in case that fails, give up.

@heijul heijul mentioned this issue Jun 4, 2023
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working garbage This is unnecessary
Projects
None yet
Development

No branches or pull requests

1 participant