Skip to content

Pappa/conlanger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

conlanger

An experiment in automatic Conlang creation.

I am a novice Conlanger, currently enjoying the view from the peak of Mount Stupid, so this may go nowhere useful. I'm mostly hoping it goes somewhere dumb and ridiculous.

Peak of Mount Stupid

Data Preperation

Language phoneme data from phoible.org was used to create a dataset suitable for ML. One dialect phoneme inventory from each language was selected and prepared as a 4d Numpy array.

Data on morphology and grammar from WALS was prepared in a similar way.

Language prediction

Before using a GAN (generative adversarial network) to generate new language phoneme inventories, I wanted to check that it was possible to predict languages by their phonemes.

Overall the accuracy is very poor, but the number of classes is very high relative to the number of training samples (approx 80%). The model tends to just pick languages with the most samples in the training data. However, it does perform better than random chance and better than just picking one of the 5 most common languages in the training set.

Language phoneme inventory generation

Here's where the fun begins. I've previously experimented building GANs to generate fake images (Joan Miro and Mark Rothko paintings), with varying degrees of success. It's relatively easy to do using Conv2D transpose layers in Keras, though requires a lot of trial and error to avoid overfitting (or sometimes just to produce anything at all). I figured that if I could represent the features of a language in a 3D vector, I could use the same GAN architecture to generate fake language phoneme inventories.

For phoneme inventory generation, I barely bothered tuning the GAN architecture that I used for Rothko paintings. It needed a few tweaks to prevent it overfitting and memorising samples. I removed some layers from the generator, reduced the number of epochs and increased the learning rate. Essentially, I just needed to make it a bit worse at generating fakes. This makes a lot of sense considering the difference in complexity between these simple pixilated phoneme inventory images and the far more complex Miro and Rothko paintings.

Morphology and grammar rule generation

Morphology and grammar rules were generated in a similar way, though it took a lot more experimentation to produce realistic rulesets. This is probably because of the way the each value is represented in the data, as an ordinal number rather than binary. The results aren't ideal as some important values can be missing from the generated data. I might need to try a different approach.

Lexicon generation

Data from the Universal Language Dictionary (obtained from web.archive.org) was used to create a basic wordlist for translation. I got a few thousand sets of phonotactic rules from ChatGPT and used their relative frequency to rate each by "weirdness". I've written a very naive lexicon generation tool that accepts a basic syllable structure and phoneme inventory, and generates a lexicon. The idea is that the phoneme inventory can be generated by the GAN and supplied to the lexicon builder. The lexicon builder produces a lot of unrealistic words, but my plan is to apply a series of sound change rules to the lexicon. I'm hoping this will result in a set of proto-language root words that seem naturalistic.

Sound change rules

I am compiling all sound change rules from the Searchable Index Diachronica into Brassica format. This is painfully slow going, and I've needed to simplify some of the rules. The result won't be an accurate representation of all of the Index Diachronica rules in Brassica format, but I think it will be close enough to generate plausible sequences of rules (again using a GAN) that can be used for the proto-language root word generation mentioned above, and for furter evolution later.

Next steps

  • Determine phonotactics (probably just default to (C)V to begin with)
  • Generate root words using the phoneme inventory
  • Determine basic grammar
  • Create proto-language lexicon
  • Apply selection of sound change rules
    • Update phonology, phonotactics, grammar and lexicon after each iteration
    • Is it possible to determine and update language morphology here?
  • Generate translations
    • At any historic period in the language evolution
    • I think this may require a set of "canned" English sentences that are annotated in some way, so that the grammar rules at the current historical period can be applied
  • Generate HTML/PDF language grammar document
  • Generate sample audio wav files

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published