Refactor langs list #128

kolakachi · 2024-05-07T07:03:12Z

This starts the process of changing from 'language' to 'model' when we're talking about OCR engine options. It makes the following changes:

Move the main config file from /public/langs.json to /public/models.json`.
Restructure that file to have the top level keys be engine IDs (google, tesseract, and transkribus), and then 'model IDs' below those (many of these look like language codes, but not all of them).
Each model entry in models.json is keyed by the model ID, and can contain entries of languages, title, htr, and line (the latter two being for Transcribus only). The languages entries is an array of ISO639 language codes, by which users will be able to browse the models (i.e. it's a rough estimation of which languages will find which models most useful; mul will be a reasonable value for some, it sounds like). title is needed if the model ID does not match a language that Intuition knows about; if it does match then that language's name will be used for the title.
Change a few methods in EngineBase to use the new terminology (but not all; subsequent patches will change the rest).
Update some tests to work with the new structure.

Bug: T358433
Bug: T330061

Created a new json `models.json` updated EngineBase and TranskribusEngine Bug: T358433

updated EngineBase and TranskribusEngine Bug: T358433

…ocr into refactor-langs-json

sohomdatta1 · 2024-05-21T13:35:51Z

nit: Do you have a script that generates ./models.json it might be useful to include that script as well.

kolakachi · 2024-05-21T17:56:28Z

nit: Do you have a script that generates ./models.json it might be useful to include that script as well.

I don't have a script at the moment.

samwilson

I guess the old langs.json is kept here for backwards compatibility? Could it perhaps be build programatically from the new models.json? (I.e. a new route added, rather than serving the static file.)

Also, it'd be great to document the schema of models.json a bit more clearly: basically, it's top level is engines, and in each is a list of models and each has at least a title and a set of languages. But are those languages given as ISO639 codes, Wikimedia lang codes, or something else? I think they should be standardized. For example, uzb_cyrl isn't a language (Uzbek is written in three scripts, I think? Latin, Cyrillic, and Arabic, so the OCR tool should be clear whether it's listing things by language or by script; I imagine the former is more common and useful although results in perhaps confusing states like trying to run a Cyrillic Uzbek text through a Arabic Uzbek model).

I guess the crucial things here are to

be able to group by language,
have unique identifiers for models (although perhaps re-use of identifiers between engines isn't a problem?), and
be able to keep the list up to date easily.

kolakachi · 2024-05-24T16:41:03Z

I guess the old langs.json is kept here for backwards compatibility? Could it perhaps be build programatically from the new models.json? (I.e. a new route added, rather than serving the static file.)

Also, it'd be great to document the schema of models.json a bit more clearly: basically, it's top level is engines, and in each is a list of models and each has at least a title and a set of languages. But are those languages given as ISO639 codes, Wikimedia lang codes, or something else? I think they should be standardized. For example, uzb_cyrl isn't a language (Uzbek is written in three scripts, I think? Latin, Cyrillic, and Arabic, so the OCR tool should be clear whether it's listing things by language or by script; I imagine the former is more common and useful although results in perhaps confusing states like trying to run a Cyrillic Uzbek text through a Arabic Uzbek model).

I guess the crucial things here are to

be able to group by language,

have unique identifiers for models (although perhaps re-use of identifiers between engines isn't a problem?), and

be able to keep the list up to date easily.

The langs.json was left because we haven't fully transitioned into the new structure.
models.json wasn't built programmatically, currently langs.json and models.json are both stand alone not depending on each other
The languages are given as Wikimedia lang codes (they were all gotten from the prev langs.json)
Thank you for pointing out the uzb_cyrl the intention was to list languages and not scripts, do we have other scripts? The focus was to change the structure, all langs/scripts were gotten from langs.json.

…ocr into refactor-langs-json

updated Create a new json `models.json` EngineBase and Engine Classes Bug: T358433

…ocr into refactor-langs-json

updated EngineBase and TranskribusEngine Bug: T358433

stweil · 2024-06-25T20:28:11Z

public/models.json

+            "languages": ["de"],
+            "title": "Deutsch"
+        },
+        "frk": {


I renamed this Tesseract model from frk to deu_latf. It will take some time until all Linux distributions notice and follow this change, but they will. Therefore "de-frk" might not be the best choice.

stweil · 2024-06-25T20:35:32Z

"langs" is not the right word nowadays. We talk about "models". In some cases these models correspond to languages, in other cases they correspond to script, and future models will be just AI models which can read text but neither correspond to languages nor to scripts because they can read any text.

* Remove kur model as it's not available on prod server. * Fix az-cyrl to aze_cyrl code.

Also fix a typo in models.json for frm.

samwilson · 2024-09-03T00:56:07Z

I think I've brought this up to date, and it's perhaps ready to go. I'll leave it a bit in case anyone wants to review.

Parthiv-M · 2024-09-09T03:22:12Z

Deleted a comment by mistake, but I see some places where the model titles have been replaced by ភាសាខ្មែរ (square boxes) instead of UTF characters. These should be changed to their respective legible model title names

Here is an (hopefully) exhaustive list

google: bo, dz, km, kn, ko, my, te
tesseract: bod, dzo, khm, kan, kor, mya, tel

samwilson · 2024-09-09T04:15:20Z

I see some places where the model titles have been replaced by ភាសាខ្មែរ (square boxes) instead of UTF characters. These should be changed to their respective legible model title names

Are you missing some fonts or something? It looks okay to me, including your reply:

Parthiv-M · 2024-09-09T14:17:17Z

Indeed, I can see it properly on my phone, not on my laptop. That's weird!

Rest of it lgtm.

kolakachi added 5 commits May 7, 2024 07:58

Refactor langs list

fa63aee

Created a new json `models.json` updated EngineBase and TranskribusEngine Bug: T358433

Refactor langs list

7adda37

Created a new json `models.json` updated EngineBase and TranskribusEngine Bug: T358433

Created a new json models.json

4cd3790

updated EngineBase and TranskribusEngine Bug: T358433

Created a new json models.json

e72ffc6

updated EngineBase and TranskribusEngine Bug: T358433

Merge branch 'refactor-langs-json' of github.com:wikimedia/wikimedia-…

9216e47

…ocr into refactor-langs-json

kolakachi requested a review from samwilson May 8, 2024 16:52

samwilson reviewed May 22, 2024

View reviewed changes

kolakachi added 7 commits June 25, 2024 17:07

Merge branch 'refactor-langs-json' of github.com:wikimedia/wikimedia-…

1bc168b

…ocr into refactor-langs-json

Create a new json models.json

2c441af

updated Create a new json `models.json` EngineBase and Engine Classes Bug: T358433

Create a new json models.json

c3977fa

updated Create a new json `models.json` EngineBase and Engine Classes Bug: T358433

Merge branch 'refactor-langs-json' of github.com:wikimedia/wikimedia-…

7343f46

…ocr into refactor-langs-json

Merge branch 'refactor-langs-json' of github.com:wikimedia/wikimedia-…

5f40b56

…ocr into refactor-langs-json

Merge branch 'refactor-langs-json' of github.com:wikimedia/wikimedia-…

92f3d20

…ocr into refactor-langs-json

Created a new json models.json

4c800da

updated EngineBase and TranskribusEngine Bug: T358433

stweil reviewed Jun 25, 2024

View reviewed changes

samwilson added 8 commits September 2, 2024 17:10

Fix check_tesseract.sh for new models.json structure

6d91f26

* Remove kur model as it's not available on prod server. * Fix az-cyrl to aze_cyrl code.

Change getValidLangs to getValidModels and getLangName to getModelTitle

d16c4d5

Also fix a typo in models.json for frm.

Merge remote-tracking branch 'origin/main' into refactor-langs-json

361fe94

Add no line ID for jv-01

c0aed7d

Add deu_latf and skip it

9f661fd

Fix some phan errors

0a67669

remove unrelated changes

9f9651b

fix phpcs error, rm unrelated changes

53d5a6a

samwilson added the Ready for review label Sep 3, 2024

Parthiv-M self-requested a review September 5, 2024 20:20

wikimedia deleted a comment from samwilson Sep 9, 2024

Parthiv-M approved these changes Sep 9, 2024

View reviewed changes

samwilson merged commit c841d48 into main Sep 9, 2024
5 checks passed

samwilson deleted the refactor-langs-json branch September 9, 2024 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor langs list #128

Refactor langs list #128

kolakachi commented May 7, 2024 •

edited by samwilson

Loading

sohomdatta1 commented May 21, 2024

kolakachi commented May 21, 2024

samwilson left a comment

kolakachi commented May 24, 2024

stweil Jun 25, 2024 •

edited

Loading

stweil commented Jun 25, 2024

samwilson commented Sep 3, 2024

Parthiv-M commented Sep 9, 2024

samwilson commented Sep 9, 2024

Parthiv-M commented Sep 9, 2024

Refactor langs list #128

Refactor langs list #128

Conversation

kolakachi commented May 7, 2024 • edited by samwilson Loading

sohomdatta1 commented May 21, 2024

kolakachi commented May 21, 2024

samwilson left a comment

Choose a reason for hiding this comment

kolakachi commented May 24, 2024

stweil Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

stweil commented Jun 25, 2024

samwilson commented Sep 3, 2024

Parthiv-M commented Sep 9, 2024

samwilson commented Sep 9, 2024

Parthiv-M commented Sep 9, 2024

kolakachi commented May 7, 2024 •

edited by samwilson

Loading

stweil Jun 25, 2024 •

edited

Loading