romanesco

A vanilla recurrent neural network (RNN) language model. Supports model training, text scoring, and text generation.

Installation

Make sure you have an NVIDIA GPU at your disposal, with all drivers and CUDA installed. Make sure you also have python >= 3.5, pip and git installed, and run

git clone https://github.com/zurichnlp/romanesco.git
cd romanesco
pip install --user -e .

If you have sudo privileges and prefer to install romanesco for all users on your system, omit the --user flag. The -e flag installs the app in “editable mode”, meaning you can change source files (such as romanesco/const.py) at any time.

Model training

Models are trained from a single plaintext file with one sentence per line. Symbols – e.g., words or characters – are delimited by blanks.

Example input (word-level):

I love the people of Iowa .
So that &apos;s the way it is .
Very simple .

Example input (character-level):

I <blank> l o v e <blank> t h e <blank> p e o p l e <blank> o f <blank> I o w a .
S o <blank> t h a t &apos; s <blank> t h e <blank> w a y <blank> i t <blank> i s .
V e r y <blank> s i m p l e .

romanesco doesn't preprocess training data. If you want to train a model on lowercased input, for example, you'll need to lowercase the training data yourself.

To train a model from corpus.train.txt using GPU 0, run

CUDA_VISIBLE_DEVICES=0 romanesco train corpus.train.txt

By default, the trained model and vocabulary will be stored in a directory called model, and logs (for monitoring with Tensorboard) in logs. You can use custom destinations through the -m and -l command line arguments, respectively. Folders will be created if they don't exist.

Some hyperparameters can be adjusted from the command line; run romanesco train -h for details. Other hyperparameters are currently hardcoded in romanesco/const.py.

Scoring

Once you've trained a model, you can use it to score texts. romanesco will calculate the perplexity of a text given a trained model. Lower is better: if you've trained a model on TV subtitles, it will typically assign lower scores to other TV subtitles than, say, an article from the New York Times.

To score my-article.txt using GPU 0, run

CUDA_VISIBLE_DEVICES=0 romanesco score my-article.txt

This assumes there is a folder called model in your current working directory, containing a model trained with romanesco (see above). If your model is stored somewhere else, use the -m command line argument.

For further options, run romanesco score -h.

Sampling

A trained model can be used to generate new text resembling the original training data. To generate a text with length 200 (number of symbols), run

CUDA_VISIBLE_DEVICES=0 romanesco sample 200

This assumes there is a folder called model in your current working directory, containing a model trained with romanesco (see above). If your model is stored somewhere else, use the -m command line argument.

For further options, run romanesco sample -h.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
bin		bin
romanesco		romanesco
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
austen.dev.txt		austen.dev.txt
austen.train.txt		austen.train.txt
austen_sample.txt		austen_sample.txt
austen_sample_1.txt		austen_sample_1.txt
austen_sample_2.txt		austen_sample_2.txt
austen_sample_4.txt		austen_sample_4.txt
loss_function_training_1.JPG		loss_function_training_1.JPG
loss_function_training_2.JPG		loss_function_training_2.JPG
loss_function_training_3.JPG		loss_function_training_3.JPG
loss_function_training_4.JPG		loss_function_training_4.JPG
modell_schema.jpg		modell_schema.jpg
preprocessing_romanesco.py		preprocessing_romanesco.py
sarahkiener_mt19_uebung4_bericht.pdf		sarahkiener_mt19_uebung4_bericht.pdf
setup.py		setup.py
test		test
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

romanesco

Installation

Model training

Scoring

Sampling

About

Releases

Packages

Languages

License

sarahkiener/romanesco

Folders and files

Latest commit

History

Repository files navigation

romanesco

Installation

Model training

Scoring

Sampling

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages