conlanger

An experiment in automatic Conlang creation.

I am a novice Conlanger, currently enjoying the view from the peak of Mount Stupid, so this may go nowhere useful. I'm mostly hoping it goes somewhere dumb and ridiculous.

Data Preperation

Language phoneme data from phoible.org was used to create a dataset suitable for ML. One dialect phoneme inventory from each language was selected and prepared as a 4d Numpy array.

Data on morphology and grammar from WALS was prepared in a similar way.

Phoible data preperation notebook: prepare_phoible_data.ipynb
WALS data preperation notebook: prepare_wals_data.ipynb
Language phoneme data npz file: language_phonemes.npz
WALS data npz file: language_parameters.npz

Language prediction

Before using a GAN (generative adversarial network) to generate new language phoneme inventories, I wanted to check that it was possible to predict languages by their phonemes.

Language prediction notebook: predict_languages.ipynb

Overall the accuracy is very poor, but the number of classes is very high relative to the number of training samples (approx 80%). The model tends to just pick languages with the most samples in the training data. However, it does perform better than random chance and better than just picking one of the 5 most common languages in the training set.

Language phoneme inventory generation

Here's where the fun begins. I've previously experimented building GANs to generate fake images (Joan Miro and Mark Rothko paintings), with varying degrees of success. It's relatively easy to do using Conv2D transpose layers in Keras, though requires a lot of trial and error to avoid overfitting (or sometimes just to produce anything at all). I figured that if I could represent the features of a language in a 3D vector, I could use the same GAN architecture to generate fake language phoneme inventories.

For phoneme inventory generation, I barely bothered tuning the GAN architecture that I used for Rothko paintings. It needed a few tweaks to prevent it overfitting and memorising samples. I removed some layers from the generator, reduced the number of epochs and increased the learning rate. Essentially, I just needed to make it a bit worse at generating fakes. This makes a lot of sense considering the difference in complexity between these simple pixilated phoneme inventory images and the far more complex Miro and Rothko paintings.

Phoneme inventory generation notebook: phoneme_gan.ipynb

Morphology and grammar rule generation

Morphology and grammar rules were generated in a similar way, though it took a lot more experimentation to produce realistic rulesets. This is probably because of the way the each value is represented in the data, as an ordinal number rather than binary. The results aren't ideal as some important values can be missing from the generated data. I might need to try a different approach.

Morphology and grammar rule generation notebook: wals_parameters_gan.ipynb

Lexicon generation

Data from the Universal Language Dictionary (obtained from web.archive.org) was used to create a basic wordlist for translation. I got a few thousand sets of phonotactic rules from ChatGPT and used their relative frequency to rate each by "weirdness". I've written a very naive lexicon generation tool that accepts a basic syllable structure and phoneme inventory, and generates a lexicon. The idea is that the phoneme inventory can be generated by the GAN and supplied to the lexicon builder. The lexicon builder produces a lot of unrealistic words, but my plan is to apply a series of sound change rules to the lexicon. I'm hoping this will result in a set of proto-language root words that seem naturalistic.

Word list creation notebook: word_list.ipynb
Lexicon generation notebook: generate_lexicon.ipynb

Sound change rules

I am compiling all sound change rules from the Searchable Index Diachronica into Brassica format. This is painfully slow going, and I've needed to simplify some of the rules. The result won't be an accurate representation of all of the Index Diachronica rules in Brassica format, but I think it will be close enough to generate plausible sequences of rules (again using a GAN) that can be used for the proto-language root word generation mentioned above, and for furter evolution later.

WIP sound change rules: sound_change.txt

Next steps

Determine phonotactics (probably just default to (C)V to begin with)
Generate root words using the phoneme inventory
Determine basic grammar
Create proto-language lexicon
Apply selection of sound change rules
- Update phonology, phonotactics, grammar and lexicon after each iteration
- Is it possible to determine and update language morphology here?
Generate translations
- At any historic period in the language evolution
- I think this may require a set of "canned" English sentences that are annotated in some way, so that the grammar rules at the current historical period can be applied
Generate HTML/PDF language grammar document
Generate sample audio wav files

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
docs		docs
notebooks		notebooks
src/conlanger		src/conlanger
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

conlanger

Data Preperation

Language prediction

Language phoneme inventory generation

Morphology and grammar rule generation

Lexicon generation

Sound change rules

Next steps

About

Releases

Packages

Languages

License

Pappa/conlanger

Folders and files

Latest commit

History

Repository files navigation

conlanger

Data Preperation

Language prediction

Language phoneme inventory generation

Morphology and grammar rule generation

Lexicon generation

Sound change rules

Next steps

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages