-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor langs list #128
Refactor langs list #128
Conversation
Created a new json `models.json` updated EngineBase and TranskribusEngine Bug: T358433
Created a new json `models.json` updated EngineBase and TranskribusEngine Bug: T358433
updated EngineBase and TranskribusEngine Bug: T358433
updated EngineBase and TranskribusEngine Bug: T358433
…ocr into refactor-langs-json
nit: Do you have a script that generates |
I don't have a script at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the old langs.json
is kept here for backwards compatibility? Could it perhaps be build programatically from the new models.json? (I.e. a new route added, rather than serving the static file.)
Also, it'd be great to document the schema of models.json a bit more clearly: basically, it's top level is engines, and in each is a list of models and each has at least a title and a set of languages. But are those languages given as ISO639 codes, Wikimedia lang codes, or something else? I think they should be standardized. For example, uzb_cyrl
isn't a language (Uzbek is written in three scripts, I think? Latin, Cyrillic, and Arabic, so the OCR tool should be clear whether it's listing things by language or by script; I imagine the former is more common and useful although results in perhaps confusing states like trying to run a Cyrillic Uzbek text through a Arabic Uzbek model).
I guess the crucial things here are to
- be able to group by language,
- have unique identifiers for models (although perhaps re-use of identifiers between engines isn't a problem?), and
- be able to keep the list up to date easily.
|
…ocr into refactor-langs-json
updated Create a new json `models.json` EngineBase and Engine Classes Bug: T358433
updated Create a new json `models.json` EngineBase and Engine Classes Bug: T358433
…ocr into refactor-langs-json
…ocr into refactor-langs-json
…ocr into refactor-langs-json
updated EngineBase and TranskribusEngine Bug: T358433
"languages": ["de"], | ||
"title": "Deutsch" | ||
}, | ||
"frk": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this Tesseract model from frk to deu_latf. It will take some time until all Linux distributions notice and follow this change, but they will. Therefore "de-frk" might not be the best choice.
"langs" is not the right word nowadays. We talk about "models". In some cases these models correspond to languages, in other cases they correspond to script, and future models will be just AI models which can read text but neither correspond to languages nor to scripts because they can read any text. |
* Remove kur model as it's not available on prod server. * Fix az-cyrl to aze_cyrl code.
Also fix a typo in models.json for frm.
I think I've brought this up to date, and it's perhaps ready to go. I'll leave it a bit in case anyone wants to review. |
Deleted a comment by mistake, but I see some places where the model titles have been replaced by ភាសាខ្មែរ (square boxes) instead of UTF characters. These should be changed to their respective legible model title names Here is an (hopefully) exhaustive list
|
Indeed, I can see it properly on my phone, not on my laptop. That's weird! Rest of it lgtm. |
This starts the process of changing from 'language' to 'model' when we're talking about OCR engine options. It makes the following changes:
/public/langs.json to
/public/models.json`.models.json
is keyed by the model ID, and can contain entries oflanguages
,title
,htr
, andline
(the latter two being for Transcribus only). Thelanguages
entries is an array of ISO639 language codes, by which users will be able to browse the models (i.e. it's a rough estimation of which languages will find which models most useful;mul
will be a reasonable value for some, it sounds like).title
is needed if the model ID does not match a language that Intuition knows about; if it does match then that language's name will be used for the title.Bug: T358433
Bug: T330061