Skip to content

defdefred/languages-of-the-world

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

Languages of the world

Written languages

ISO 15924 tells us that there is only 210 codes to covers all writing systems.

https://unicode.org/iso15924/iso15924-codes.html

$ wc -l iso15924.txt
210 iso15924.txt

A full byte is enough and allow new one to be added (256-210=46).

So why using:

  • this 4 letters code (26x26x26x26=456976)
  • and numerical range 000->999 ?

Let's convert the 4 letters code to pure numerical value:

https://github.com/defdefred/languages-of-the-world/blob/main/iso15924.csv

The code used is:

cat iso15924.txt | while IFS=\; read CODE NUM TXT ; do
        A1=$(printf '%d' "'$(echo $CODE |cut -c 1)")
        A2=$(printf '%d' "'$(echo $CODE |cut -c 2)")
        A3=$(printf '%d' "'$(echo $CODE |cut -c 3)")
        A4=$(printf '%d' "'$(echo $CODE |cut -c 4)")
        VAL=$(echo "($A1 - 65) * ( 26*26*26 ) + ( $A2 - 97 ) * ( 26 * 26 ) + ( $A3 - 97 ) *  26 + ( $A4 - 97 )" | bc)
        echo "$VAL;$CODE;$NUM;$TXT"
done

We need 3 bytes to covert all written languages, with huge space remaining...

Maybe using the numerical code that fit in 2 bytes with huge space remaining is better...

https://unicode.org/iso15924/iso15924-num.html

Latin alphabet is 215, so:

expr 0 \* 256 + 215
215

Spoken languages

ISO 639-3 tells us that there is 7893 codes to covers all referenced languages.

https://iso639-3.sil.org/code_tables/download_tables#Complete%20Code%20Tables

$ wc -l iso639-3.txt
7893 iso639-3.txt

A full byte is enough for all most used language and another optional byte is more than enough for the others (65536-7893=57643).

So why using a 3 letters code (26x26x26=17576)

Let's convert the 3 letters code to full numerical value:

https://github.com/defdefred/languages-of-the-world/blob/main/iso639-3.csv

The code used is:

cat iso639-3.txt | while read CODE rest ; do
        A2=$(printf '%d' "'$(echo $CODE |cut -c 1)")
        A3=$(printf '%d' "'$(echo $CODE |cut -c 2)")
        A4=$(printf '%d' "'$(echo $CODE |cut -c 3)")
        VAL=$(echo "( $A2 - 97 ) * ( 26 * 26 ) + ( $A3 - 97 ) *  26 + ( $A4 - 97 )" | bc)
        echo "$VAL;$CODE;$rest"
done

We need 2 bytes to covert all spoken language, with huge space remaining...

English is 3048, so:

expr 11 \* 256 + 232
3048

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published