#str2vec
This work was done while Peng Li was a Ph.D. student of Tsinghua University.
str2vec is a toolkit for computing vector-space representations for variable-length phrases using recursive autoencoders (RAE, the following figure is a RAE example).
In this document, we demostrate
- How to train RAEs in an unsupervised and parallelized manner;
- And how to compute vector-space representations for phrases once the RAE is trained.
For more information about recursive auto encoders, please refer to:
Richard Socher, Jeffrey Pennington, Eric Huang, Andrew Y. Ng, and Christopher D. Manning. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pp. 151-161.
This toolkit has been tested on Ubuntu 14.04 and Mac OS 10.9. But it should work on other platforms which supported by the following softwares:
- Python 2.7.8 or later (Python 3.x is not supported)
- open-mpi 1.8.1 or later (other MPI implementation supported by mpi4py should also be OK)
- Numpy 1.8.1 or later
- Scipy 0.14.0 or later
- mpi4py 1.3.1 or later
Note: we find some API(s) of mpi4py or open-mpi had been changed in recent versions. If you meet errors such as "TypeError: bcast() takes at least 1 positional argument (0 given)", please fall back to the versions as above.
Python is easy to install. open-mpi is usually available on most Linux platforms. Numpy, Scipy and mpi4py are available from pip. Alternatively, you can install all these softwares from source code if you like or you do care about efficiency.
We use the following commands to install the above softwares on Ubuntu 14.04:
# install python, python development files and pip
sudo apt-get install python python-dev python-pip
# install open-mpi runtime and development files and tools
sudo apt-get install openmpi-bin libopenmpi-dev
# install mpi4py
sudo pip install mpi4py
# install numpy and scipy
sudo apt-get install g++ gfortran build-essential
sudo apt-get install libopenblas-dev liblapack-dev
sudo pip install numpy scipy
A toy demo is placed under the folder demo-data
. We assume you have already unpack the demo data and use $DEMODIR to refer to the root directory of the demo.
You need to provide two input files:
- word vectors file
- training phrases file
The word vectors file should look like this
10 3
word1 -0.049524 0.033159 0.008865
word2 -0.049524 0.033159 0.008865
word3 -0.049524 0.033159 0.008865
.....
word10 -0.049524 0.033159 0.008865
The first line is vocabulary_size word_embedding_size
and is optional. The reset of the file are vocabulary_size
lines which look like word value1 value2 ... value3
. These word vectors can be trained using any toolkit, e.g. word2vec.
The training phrases file should look like this
first phrase ||| 10
second phrase ||| 1
...
nth phrase ||| 8
a phrase without frequency
|||
is a separator between the phrase and its frequency (an integer). The frequency is optional, in which case it is assumed to be 1. As we use L-BFGS, which is a batch mode algorithm, to train the recursive autoencoders, the frequency can help us to save computation time (otherwise we need to the same computation N
times if a phrase occurs N
times).
We have provided a demo training script mpi-train.sh
as following
#!/bin/bash
if [ $# -ne 1 ]
then
echo "Usage: $0 coreNum"
exit -1
fi
N=$1
DEMODIR=`pwd -P`
export PYTHONPATH=$DEMODIR/bin/str2vec/src
mpirun -n $1 python $PYTHONPATH/nn/lbfgstrainer.py\
-instances $DEMODIR/input/sample-training-file.txt\
-model $DEMODIR/output/sample-training-file.mpi-$N.model.gz\
-word_vector $DEMODIR/input/sample-word-vectors-trained-by-word2vec.txt\
-lambda_reg 0.15\
-m 200
You can use ./mpi-train.sh 2
to start 2 processes to train the RAE. The output looks like
Instances file: /Users/lpeng/exp/str2vec/str2vec-demo/input/sample-training-file.txt
Model file: /Users/lpeng/exp/str2vec/str2vec-demo/output/sample-training-file.mpi-2.model.gz
Word vector file: /Users/lpeng/exp/str2vec/str2vec-demo/input/sample-word-vectors-trained-by-word2vec.txt
lambda_reg: 0.149999999999999994
Max iterations: 200
load word vectors...
preparing data...
init. RAE parameters...
seed: None
shape of theta0 430
optimizing...
saving parameters to /Users/lpeng/exp/str2vec/str2vec-demo/output/sample-training-file.mpi-2.model.gz
Init. theta0 : 0.00 s
Optimizing : 0.87 s
Saving theta : 0.01 s
Done!
There are 5 parameters for the python program $PYTHONPATH/nn/lbfgstrainer.py
-
-instances
: the training phrase file -
-model
: output model file -
-word_vector
: word vectors file -
-lambda_reg
: the training objective function of our toolkit is$$J(\theta)=AverageReconstructionError+\frac{\lambda}{2}||\theta||^2$$ and-lambda_reg
is the value of $$$\lambda$$$. Its default value is 0.15. -
-m
: max number of iterations
The model files will be output into the directory if you use the above script
sample-training-file.mpi-2.model.gz
: the binary model file used by the toolkitsample-training-file.mpi-2.model.gz.txt
: for human reading
We have provided a demo script $DEMODIR/compute-vector.sh
for computing vector space representations for the phrases in $DEMODIR/input/sample-training-file.txt
as following:
#!/bin/bash
DEMODIR=`pwd -P`
export PYTHONPATH=$DEMODIR/bin/str2vec/src
python $PYTHONPATH/nn/rae.py\
$DEMODIR/input/sample-training-file.txt\
$DEMODIR/input/sample-word-vectors-trained-by-word2vec.txt\
$DEMODIR/output/sample-training-file.mpi-2.model.gz\
$DEMODIR/output/sample-training-file.vec.txt
There are 4 parameters for the python program $PYTHONPATH/nn/rae.py
- 1st: input file contains phrases
- 2nd: word vectors file
- 3rd: binary model file
- 4th: output file
The phrase vectors will be output into the output file
, and you will see the following command output which reports reconstruction errors.
load word vectors...
load RAE parameters...
===============================================================
all avg/node internal node
---------------------------------------------------------------
0.00000000, 0.00000000, 0
0.12480469, 0.12480469, 1
2.46327700, 0.61581925, 4
0.00000000, 0.00000000, 0
0.00000000, 0.00000000, 0
14.34905919, 1.30445993, 11
6.79240564, 2.26413521, 3
3.26325339, 0.81581335, 4
3.92313449, 3.92313449, 1
1.54129705, 0.51376568, 3
6.65650955, 3.32825477, 2
2.22864149, 0.55716037, 4
---------------------------------------------------------------
average reconstruction error per instance: 3.44519854
average reconstruction error per node: 1.25279947
===============================================================
Peng Li, Yang Liu, Maosong Sun. Recursive Autoencoders for ITG-based Translation. Proc. of EMNLP 2013: Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, 2013, pp. 567-577.