Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor langs list #128

Merged
merged 20 commits into from
Sep 9, 2024
Merged

Refactor langs list #128

merged 20 commits into from
Sep 9, 2024

Conversation

kolakachi
Copy link
Collaborator

@kolakachi kolakachi commented May 7, 2024

This starts the process of changing from 'language' to 'model' when we're talking about OCR engine options. It makes the following changes:

  • Move the main config file from /public/langs.json to /public/models.json`.
  • Restructure that file to have the top level keys be engine IDs (google, tesseract, and transkribus), and then 'model IDs' below those (many of these look like language codes, but not all of them).
  • Each model entry in models.json is keyed by the model ID, and can contain entries of languages, title, htr, and line (the latter two being for Transcribus only). The languages entries is an array of ISO639 language codes, by which users will be able to browse the models (i.e. it's a rough estimation of which languages will find which models most useful; mul will be a reasonable value for some, it sounds like). title is needed if the model ID does not match a language that Intuition knows about; if it does match then that language's name will be used for the title.
  • Change a few methods in EngineBase to use the new terminology (but not all; subsequent patches will change the rest).
  • Update some tests to work with the new structure.

Bug: T358433
Bug: T330061

kolakachi added 5 commits May 7, 2024 07:58
Created a new json `models.json`
updated EngineBase and TranskribusEngine

Bug: T358433
Created a new json `models.json`
updated EngineBase and TranskribusEngine

Bug: T358433
updated EngineBase and TranskribusEngine

Bug: T358433
updated EngineBase and TranskribusEngine

Bug: T358433
@kolakachi kolakachi requested a review from samwilson May 8, 2024 16:52
@sohomdatta1
Copy link
Collaborator

nit: Do you have a script that generates ./models.json it might be useful to include that script as well.

@kolakachi
Copy link
Collaborator Author

nit: Do you have a script that generates ./models.json it might be useful to include that script as well.

I don't have a script at the moment.

Copy link
Member

@samwilson samwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the old langs.json is kept here for backwards compatibility? Could it perhaps be build programatically from the new models.json? (I.e. a new route added, rather than serving the static file.)

Also, it'd be great to document the schema of models.json a bit more clearly: basically, it's top level is engines, and in each is a list of models and each has at least a title and a set of languages. But are those languages given as ISO639 codes, Wikimedia lang codes, or something else? I think they should be standardized. For example, uzb_cyrl isn't a language (Uzbek is written in three scripts, I think? Latin, Cyrillic, and Arabic, so the OCR tool should be clear whether it's listing things by language or by script; I imagine the former is more common and useful although results in perhaps confusing states like trying to run a Cyrillic Uzbek text through a Arabic Uzbek model).

I guess the crucial things here are to

  • be able to group by language,
  • have unique identifiers for models (although perhaps re-use of identifiers between engines isn't a problem?), and
  • be able to keep the list up to date easily.

@kolakachi
Copy link
Collaborator Author

I guess the old langs.json is kept here for backwards compatibility? Could it perhaps be build programatically from the new models.json? (I.e. a new route added, rather than serving the static file.)

Also, it'd be great to document the schema of models.json a bit more clearly: basically, it's top level is engines, and in each is a list of models and each has at least a title and a set of languages. But are those languages given as ISO639 codes, Wikimedia lang codes, or something else? I think they should be standardized. For example, uzb_cyrl isn't a language (Uzbek is written in three scripts, I think? Latin, Cyrillic, and Arabic, so the OCR tool should be clear whether it's listing things by language or by script; I imagine the former is more common and useful although results in perhaps confusing states like trying to run a Cyrillic Uzbek text through a Arabic Uzbek model).

I guess the crucial things here are to

  • be able to group by language,
  • have unique identifiers for models (although perhaps re-use of identifiers between engines isn't a problem?), and
  • be able to keep the list up to date easily.
  • The langs.json was left because we haven't fully transitioned into the new structure.
  • models.json wasn't built programmatically, currently langs.json and models.json are both stand alone not depending on each other
  • The languages are given as Wikimedia lang codes (they were all gotten from the prev langs.json)
  • Thank you for pointing out the uzb_cyrl the intention was to list languages and not scripts, do we have other scripts? The focus was to change the structure, all langs/scripts were gotten from langs.json.

updated Create a new json `models.json` EngineBase and Engine Classes

Bug: T358433
updated Create a new json `models.json` EngineBase and Engine Classes

Bug: T358433
updated EngineBase and TranskribusEngine

Bug: T358433
"languages": ["de"],
"title": "Deutsch"
},
"frk": {
Copy link
Contributor

@stweil stweil Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this Tesseract model from frk to deu_latf. It will take some time until all Linux distributions notice and follow this change, but they will. Therefore "de-frk" might not be the best choice.

@stweil
Copy link
Contributor

stweil commented Jun 25, 2024

"langs" is not the right word nowadays. We talk about "models". In some cases these models correspond to languages, in other cases they correspond to script, and future models will be just AI models which can read text but neither correspond to languages nor to scripts because they can read any text.

@samwilson
Copy link
Member

I think I've brought this up to date, and it's perhaps ready to go. I'll leave it a bit in case anyone wants to review.

@wikimedia wikimedia deleted a comment from samwilson Sep 9, 2024
@Parthiv-M
Copy link
Collaborator

Deleted a comment by mistake, but I see some places where the model titles have been replaced by ភាសាខ្មែរ (square boxes) instead of UTF characters. These should be changed to their respective legible model title names

Here is an (hopefully) exhaustive list

  • google: bo, dz, km, kn, ko, my, te
  • tesseract: bod, dzo, khm, kan, kor, mya, tel

@samwilson
Copy link
Member

I see some places where the model titles have been replaced by ភាសាខ្មែរ (square boxes) instead of UTF characters. These should be changed to their respective legible model title names

Are you missing some fonts or something? It looks okay to me, including your reply:

image

@Parthiv-M
Copy link
Collaborator

Indeed, I can see it properly on my phone, not on my laptop. That's weird!

Rest of it lgtm.

@samwilson samwilson merged commit c841d48 into main Sep 9, 2024
5 checks passed
@samwilson samwilson deleted the refactor-langs-json branch September 9, 2024 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants