This repository contains a usable code from the paper:
G. Marra, A. Zugarini, S. Melacci, and M. Maggini, “An unsupervised character-aware neural approach to word and context representation learning,” in Proceedings of the 27th International Conference on Artificial Neural Networks – ICANN 2018
The structure of the project contains:
- A
data
folder, containing the txt files from which to learn the embeddings. - A
log
folder, where the model is saved. - The
char2word.py
script, which is the main routine. - The
encoder.py
script, which contains some utility functions.
For a standard learning procedure do the following.
- Be sure both
data
andlog
folder are present. - Put your training data into the
data
folder, with files named asdata(something).txt
, e.g.data01.txt
. - Simply run
char2word.py
The script will create a vocabulary.txt
file inside the data
folder to be used during training.
All the configurations are set to the default ones (i.e. the paper ones). The script does not yet provide a command-line configurations (apart for folder configuration, run --help
for info).
The user willing to have a custom configuration should modify the Config
configuration class in the char2word.py
script.
We will provide a more user-friendly command-line interface as soon as possible, together with more details about the training procedure and how to incorporate the model inside bigger models.
To see a fast way to exploit already trained embeddings look at the SentenceEncoder
class together with the test
function.
This code has been tested with tensorflow==1.4
and python2.7
. Moreover, it has a dependency with the python library nltk
.