-
Notifications
You must be signed in to change notification settings - Fork 356
Languages
ConceptNet is built from multilingual data, and covers hundreds of languages.
Languages in ConceptNet are identified by their BCP 47 language code, a two- or three-letter code that's been standardized by IANA (and formerly by ISO).
When languages overlap and sources disagree on how to distinguish them, we usually represent them only as the broadest language code. This allows us to recognize as many terms as possible, though as a result the terms they are linked to may not all be understood by all speakers of those languages.
- Traditional and Simplified Chinese, with vocabulary from Mandarin, Cantonese, and other Chinese languages, all appear under the language code
zh
. - Dialects of Arabic all appear under the language code
ar
. - Serbian, Croatian, and Bosnian appear under the language code
sh
. - Bahasa Indonesia and Bahasa Malaysia both appear under the language code
ms
. - Norwegian Bokmål and Nynorsk both appear under the language code
no
.
There are 10 core languages, where we believe ConceptNet supports the language well, with a very large vocabulary, lots of assertions expressed within that language, and varied sources of knowledge. In these 10 languages, we provide all API features, including word vectors.
Code | Language | Autonym | Vocabulary size |
---|---|---|---|
en | English | English | 1803873 |
fr | French | français | 3023144 |
it | Italian | italiano | 1078629 |
de | German | Deutsch | 825741 |
es | Spanish | español | 782760 |
ru | Russian | русский | 680205 |
pt | Portuguese | português | 473709 |
ja | Japanese | 日本語 | 363663 |
nl | Dutch | Nederlands | 267641 |
zh | Chinese | 中文 | 242746 |
There are 68 more common languages, with vocabularies of at least 10,000 terms. We generate word vectors for these languages, but don't provide them through the API (that would be too many vectors to keep around in memory). Some terms in these languages may only be connected to the rest of the graph via languages in the core set.
About 90% of the world's population natively speaks one of these languages or the core languages. This list also includes some languages that are historical (such as Latin) or constructed (such as Esperanto).
Code | Language | Autonym | Vocabulary size |
---|---|---|---|
af | Afrikaans | Afrikaans | 19804 |
ang | Old English | Ænglisc | 14898 |
ar | Arabic | العربية | 134311 |
ast | Asturian | asturianu | 52118 |
az | Azerbaijani | azərbaycan dili | 15465 |
be | Belarusian | беларуская | 20199 |
bg | Bulgarian | български | 337071 |
ca | Catalan | català | 123780 |
cs | Czech | čeština | 129183 |
cy | Welsh | Cymraeg | 18184 |
da | Danish | dansk | 67915 |
el | Greek | Ελληνικά | 71970 |
eo | Esperanto | esperanto | 171527 |
et | Estonian | eesti | 29968 |
eu | Basque | euskara | 52340 |
fa | Persian | فارسی | 61883 |
fi | Finnish | suomi | 381278 |
fil | Filipino | Filipino | 17620 |
fo | Faroese | føroyskt | 18081 |
fro | Old French | 33797 | |
ga | Irish | Gaeilge | 42963 |
gd | Scottish Gaelic | Gàidhlig | 27313 |
gl | Galician | galego | 76598 |
grc | Ancient Greek | 37717 | |
gv | Manx | Gaelg | 14404 |
he | Hebrew | עברית | 40906 |
hi | Hindi | हिन्दी | 21163 |
hsb | Upper Sorbian | hornjoserbšćina | 52975 |
hu | Hungarian | magyar | 81638 |
hy | Armenian | հայերեն | 34969 |
io | Ido | Ido | 39078 |
is | Icelandic | íslenska | 55767 |
ka | Georgian | ქართული | 41801 |
kk | Kazakh | қазақ тілі | 20779 |
ko | Korean | 한국어 | 47268 |
ku | Kurdish | Kürtçe | 15680 |
la | Latin | Lingua latina | 1334135 |
lt | Lithuanian | lietuvių | 30523 |
lv | Latvian | latviešu | 48870 |
mg | Malagasy | Malagasy | 53264 |
mk | Macedonian | македонски | 33270 |
ms | Malay | Bahasa Melayu | 124646 |
mul | Multilingual | 20214 | |
no | Norwegian | norsk bokmål | 125633 |
non | Old Norse | 7868 | |
nrf | Norman French | Jèrriais / Guernésiais | 19687 |
nv | Navajo | Diné bizaad | 12432 |
oc | Occitan | occitan | 42122 |
pl | Polish | polski | 191190 |
ro | Romanian | română | 66260 |
rup | Aromanian | armãneashti | 7212 |
sa | Sanskrit | संस्कृत | 9584 |
se | Northern Sami | davvisámegiella | 134734 |
sh | Croatian / Bosnian / Serbian | srpskohrvatski | 148819 |
sk | Slovak | slovenčina | 29768 |
sl | Slovenian | slovenščina | 160496 |
sq | Albanian | shqip | 24692 |
sv | Swedish | svenska | 268402 |
sw | Swahili | Kiswahili | 12648 |
ta | Tamil | தமிழ் | 11785 |
te | Telugu | తెలుగు | 26434 |
th | Thai | ไทย | 103096 |
tr | Turkish | Türkçe | 65892 |
uk | Ukrainian | українська | 46284 |
ur | Urdu | اردو | 11832 |
vi | Vietnamese | Tiếng Việt | 54774 |
vo | Volapük | Volapük | 14678 |
xcl | Classical Armenian | 26356 |
The vocabulary of ConceptNet supports a total of 304 languages. To be represented in ConceptNet, a language must have a written orthography, and we must be able to extract a vocabulary of at least 300 words from ConceptNet's data sources.
We exclude languages below this cutoff because their data is too likely to be unrepresentative, unhelpful, or erroneous. (One language, Yi, meets this cutoff at one stage of building, even though its vocabulary in the final ConceptNet graph is too small.)
This table shows all the supported languages, alphabetically by their language code:
Code | Language | Vocabulary size |
---|---|---|
aa | Afar | 1451 |
ab | Abkhazian | 744 |
abe | Western Abenaki | 394 |
adx | Amdo Tibetan | 1107 |
ady | Adyghe | 7202 |
ae | Avestan | 371 |
af | Afrikaans | 19804 |
aii | Assyrian Neo-Aramaic | 720 |
ain | Ainu | 744 |
akk | Akkadian | 734 |
akz | Alabama | 371 |
alt | Southern Altai | 843 |
am | Amharic | 2396 |
an | Aragonese | 5173 |
ang | Old English | 14898 |
ar | Arabic | 134311 |
arc | Aramaic | 3871 |
arn | Mapuche | 2431 |
ast | Asturian | 52118 |
av | Avar | 479 |
axm | Middle Armenian | 596 |
az | Azerbaijani | 15465 |
ba | Bashkir | 5793 |
bal | Baluchi | 765 |
be | Belarusian | 20199 |
bg | Bulgarian | 337071 |
bi | Bislama | 257 |
bm | Bambara | 5211 |
bn | Bengali | 9907 |
bo | Tibetan | 1620 |
br | Breton | 18069 |
ca | Catalan | 123780 |
ccc | Chamicuro | 883 |
ce | Chechen | 3372 |
ceb | Cebuano | 11346 |
ch | Chamorro | 419 |
chk | Chuukese | 1356 |
chl | Cahuilla | 1021 |
cho | Choctaw | 364 |
chr | Cherokee | 2105 |
cic | Chickasaw | 1451 |
cim | Cimbrian | 738 |
cjs | Shor | 751 |
co | Corsican | 4936 |
cop | Coptic | 383 |
crh | Crimean Turkish | 5550 |
cs | Czech | 129183 |
csb | Kashubian | 1336 |
cu | Church Slavic | 15262 |
cv | Chuvash | 3818 |
cy | Welsh | 18184 |
da | Danish | 67915 |
dak | Dakota | 1067 |
de | German | 825741 |
dje | Zarma | 727 |
dlm | Dalmatian | 1796 |
dsb | Lower Sorbian | 8526 |
dua | Duala | 679 |
dum | Middle Dutch | 2351 |
dv | Divehi | 519 |
ee | Ewe | 1207 |
egl | Emilian | 849 |
egx | Egyptian languages | 1047 |
egy | Ancient Egyptian | 402 |
el | Greek | 71970 |
en | English | 1803873 |
enm | Middle English | 13013 |
eo | Esperanto | 171527 |
es | Spanish | 782760 |
esu | Central Yupik | 553 |
et | Estonian | 29968 |
eu | Basque | 52340 |
fa | Persian | 61883 |
ff | Fula | 315 |
fi | Finnish | 381278 |
fil | Filipino | 17620 |
fj | Fijian | 463 |
fo | Faroese | 18081 |
fon | Fon | 1642 |
fr | French | 3023144 |
frk | Frankish | 620 |
frm | Middle French | 10931 |
fro | Old French | 33797 |
frp | Franco-Provençal | 5568 |
frr | Northern Frisian | 1085 |
fur | Friulian | 5101 |
fy | Western Frisian | 12009 |
ga | Irish | 42963 |
gag | Gagauz | 1118 |
gd | Scottish Gaelic | 27313 |
gl | Galician | 76598 |
gmh | Middle High German | 2368 |
gml | Middle Low German | 2155 |
gn | Guarani | 624 |
goh | Old High German | 6385 |
got | Gothic | 3726 |
grc | Ancient Greek | 37717 |
gsw | Swiss German | 1752 |
gu | Gujarati | 3361 |
gv | Manx | 14404 |
ha | Hausa | 1164 |
hak | Hakka | 788 |
haw | Hawaiian | 3104 |
hbo | Ancient Hebrew | 5864 |
he | Hebrew | 40906 |
hi | Hindi | 21163 |
hil | Hiligaynon | 2914 |
hit | Hittite | 379 |
hke | Hunde | 492 |
hsb | Upper Sorbian | 52975 |
ht | Haitian Creole | 3799 |
hu | Hungarian | 81638 |
hy | Armenian | 34969 |
ia | Interlingua | 8632 |
ie | Interlingue | 698 |
ii | Yi | 105 |
ilo | Ilocano | 470 |
io | Ido | 39078 |
is | Icelandic | 55767 |
ist | Istriot | 848 |
it | Italian | 1078629 |
iu | Inuktitut | 3739 |
ja | Japanese | 363663 |
jbo | Lojban | 2043 |
jv | Javanese | 5287 |
ka | Georgian | 41801 |
kbd | Kabardian | 1422 |
khb | Tai Lü | 335 |
ki | Kikuyu | 473 |
kim | Tofa | 799 |
kjh | Khakas | 1009 |
kk | Kazakh | 20779 |
kl | Kalaallisut | 2127 |
km | Khmer | 5971 |
kn | Kannada | 4278 |
ko | Korean | 47268 |
koy | Koyukon | 422 |
krc | Karachay-Balkar | 1108 |
krl | Karelian | 943 |
ku | Kurdish | 15680 |
kum | Kumyk | 1213 |
kw | Cornish | 3707 |
ky | Kyrgyz | 6349 |
la | Latin | 1334135 |
lad | Ladino | 3108 |
lb | Luxembourgish | 16743 |
li | Limburgish | 1627 |
lij | Ligurian | 769 |
liv | Livonian | 883 |
lkt | Lakota | 1443 |
lld | Ladin | 9924 |
lmo | Lombard | 2332 |
ln | Lingala | 8624 |
lo | Lao | 3892 |
lt | Lithuanian | 30523 |
ltg | Latgalian | 1428 |
lv | Latvian | 48870 |
lzz | Laz | 376 |
mch | Maquiritari | 791 |
mdf | Moksha | 3828 |
mg | Malagasy | 53264 |
mga | Middle Irish | 425 |
mh | Marshallese | 377 |
mi | Maori | 8786 |
mk | Macedonian | 33270 |
ml | Malayalam | 7748 |
mn | Mongolian | 9821 |
mr | Marathi | 6879 |
ms | Malay | 124646 |
mt | Maltese | 5921 |
mul | Multilingual | 20214 |
mwl | Mirandese | 2593 |
my | Burmese | 5238 |
myv | Erzya | 1443 |
na | Nauru | 538 |
nah | Nahuatl languages | 2587 |
nan | Min Nan Chinese | 3963 |
nap | Neapolitan | 2406 |
nci | Classical Nahuatl | 5619 |
nds | Low German | 8668 |
ne | Nepali | 6041 |
nhn | Central Nahuatl | 437 |
nl | Dutch | 267641 |
nmn | !Xóõ | 594 |
no | Norwegian | 125633 |
nog | Nogai | 1064 |
non | Old Norse | 7868 |
nov | Novial | 1585 |
nrf | Jèrriais | 19687 |
nv | Navajo | 12432 |
oc | Occitan | 42122 |
odt | Old Dutch | 838 |
ofs | Old Frisian | 1848 |
oge | Old Georgian | 778 |
oj | Ojibwa | 1388 |
oma | Omaha-Ponca | 1502 |
or | Oriya | 538 |
orv | Old Russian | 597 |
os | Ossetic | 9608 |
osp | Old Spanish | 1186 |
osx | Old Saxon | 4375 |
ota | Ottoman Turkish | 1993 |
pa | Punjabi | 4648 |
pal | Pahlavi | 1235 |
pap | Papiamento | 7222 |
pcd | Picard | 3146 |
peo | Old Persian | 324 |
pi | Pali | 2027 |
pjt | Pitjantjatjara | 479 |
pl | Polish | 191190 |
pms | Piedmontese | 3157 |
ppl | Pipil | 491 |
prg | Prussian | 1298 |
pro | Old Provençal | 11884 |
ps | Pashto | 2314 |
pt | Portuguese | 473709 |
qu | Quechua | 7029 |
qya | Quenya | 370 |
raj | Rajasthani | 393 |
rap | Rapa Nui | 661 |
rm | Romansh | 9773 |
ro | Romanian | 66260 |
roa-opt | Old Portuguese | 2959 |
rom | Romany | 1370 |
ru | Russian | 680205 |
rue | Rusyn | 319 |
rup | Aromanian | 7212 |
rw | Kinyarwanda | 1016 |
sa | Sanskrit | 9584 |
sah | Sakha | 3551 |
sc | Sardinian | 3617 |
scn | Sicilian | 7215 |
sco | Scots | 11789 |
sd | Sindhi | 439 |
se | Northern Sami | 134734 |
ses | Koyraboro Senni | 6451 |
sga | Old Irish | 5976 |
sh | Serbo-Croatian | 148819 |
shh | Shoshoni | 366 |
si | Sinhala | 2877 |
sk | Slovak | 29768 |
sl | Slovenian | 160496 |
sm | Samoan | 1378 |
smn | Inari Sami | 632 |
sms | Skolt Sami | 599 |
so | Somali | 1514 |
sq | Albanian | 24692 |
srn | Sranan Tongo | 2344 |
st | Sotho | 458 |
stq | Saterland Frisian | 2402 |
su | Sundanese | 3044 |
sux | Sumerian | 1012 |
sv | Swedish | 268402 |
sw | Swahili | 12648 |
swb | Comorian | 1958 |
syc | Classical Syriac | 5540 |
szl | Silesian | 340 |
ta | Tamil | 11785 |
te | Telugu | 26434 |
tet | Tetum | 798 |
tg | Tajik | 5683 |
th | Thai | 103096 |
ti | Tigrinya | 653 |
tk | Turkmen | 3061 |
tpi | Tok Pisin | 1761 |
tpw | Tupi | 573 |
tr | Turkish | 65892 |
tt | Tatar | 6440 |
twf | Northern Tiwa | 941 |
txb | Tokharian B | 391 |
ty | Tahitian | 684 |
tyv | Tuvinian | 872 |
udm | Udmurt | 489 |
ug | Uyghur | 2089 |
uga | Ugaritic | 712 |
uk | Ukrainian | 46284 |
ur | Urdu | 11832 |
uz | Uzbek | 7519 |
vec | Venetian | 10830 |
vep | Veps | 3691 |
vi | Vietnamese | 54774 |
vo | Volapük | 14678 |
vot | Votic | 970 |
wa | Walloon | 4797 |
wae | Walser | 879 |
war | Waray | 12607 |
wau | Waurá | 380 |
wo | Wolof | 2549 |
wym | Wymysorys | 2621 |
xcl | Classical Armenian | 26356 |
xh | Xhosa | 407 |
xmf | Mingrelian | 484 |
xno | Anglo-Norman | 559 |
xpr | Parthian | 384 |
xto | Tokharian A | 280 |
xwo | Oirat | 1408 |
yi | Yiddish | 13591 |
yo | Yoruba | 2672 |
yua | Yucateco | 1546 |
za | Zhuang | 315 |
zdj | Ngazidja Comorian | 1908 |
zh | Chinese | 242746 |
zu | Zulu | 2217 |
zza | Zaza | 1006 |
Starting points
Reproducibility
Details