Is there any great idea to handle oov(out of vocab.) problem in implementation? #6912

Zaoyee · 2022-07-15T14:17:38Z

Zaoyee
Jul 15, 2022

Hi,

When I develop a language based OCR system, I faced a problem that the recogition fed an image with a unknown token .

etc. Lets say, I develop a french OCR model with french charset which has À Á Â characters, but when recoginition image is based on other lanugage like German character Ä.

In prediction, it could be predicted as any character between À Á Â or even general capital A in good case, however it could be predicted as anything ridiculous, such as character E.

I was wondering if there is great way to handle this instead of developing a Latin system model which needs amount of data in different language.
I understand a couple bad options:

discard the data with unknown character
or
put the unknown character into some general unknown token, like <unk>. (ps: It souds great to common sence but it does not make too much sence for classifiction problem in deep learning.)
or
make or A like charaters which do not belong to french charset to the general character A.
or
any great ideas?

Evezerest · 2022-07-27T13:28:32Z

Evezerest
Jul 27, 2022
Collaborator

Hi, we are preparing a colab tutorial about how to modify the language dictionary to solve the unknown character problem.
For now, you can follow the recognition model training doc, use your own dictionary including the characters you want to recognize, collect (or synthesize) some data to train a new model

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any great idea to handle oov(out of vocab.) problem in implementation? #6912

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Is there any great idea to handle oov(out of vocab.) problem in implementation? #6912

Zaoyee Jul 15, 2022

Replies: 1 comment

Evezerest Jul 27, 2022 Collaborator

Zaoyee
Jul 15, 2022

Evezerest
Jul 27, 2022
Collaborator