Trance parser is an implementation of transition-based neural constituent parsing [1], a transition-based parser with neural networks to score all the derivation histories.
Currently, we support following neural networks:
- Model1: no feedback from stacks or contexts (tree model [1])
- Model2: feedback from stacks for shift actions
- Model3: Model2 + queue contexts
- Model4: Model2 + feed back from stack for reduce/unary actions (+stack model [1])
- Model5: Model4 + queue contexts (+queue model [1])
Various training objective:
- {max,early,late}-violation with expected/Viterbi mistakes
- expected evalb
- structured hinge loss
and online optimizer: SGD, AdaGrad, AdaDec and AdaDelta.
The latest code is available from github.com.
We follow a standard practice of configure/make/make install. For details, see BUILD.rst.
./autogen.sh (required when you get the code by git clone)
./configure
make
make install (optional)
We provide models for 2 languages, English (WSJ) and Chinese (CTB). They are Model5 which performs the best in our settings. Following is an example to run our modes, using STDIN/STDOUT as our input/output (assuming utf-8 encoding of input/output):
progs/trance_parse \
--grammar models/{WSJ,CTB}-grammar.gz \
--model models/{WSJ,CTB}-model \
--unary {3,4} \
--signature {English,Chinese} \
--precompute \
--simple
where --unary
specifies the number of consequtive unaries and
uses 3 for WSJ, and 4 for CTB. --signature
is used to represent
OOVs based on the word's signature and --precompute
performs word
representation precomputation for faster parsing. The option
--simple
specifies a Penn-treebank style output format.
Input sentences are assumed to be tokenized according to their
standards: For English, it is recommended to use a tokenizer from the
Stanford Parser.
For Chinese, the Stanford Word Segmenter is a good choice.
Sample scripts are available in samples/train-{wsj,ctb}.sh for training WSJ and CTB, respectively, using publicly available tools for preprocessing.
In brief, first, we need to obtain treebank trees in a normalized form:
cat [treebank files] | \
progs/trance_treebank \
--output [output normalized treebank]
--normalize \
--remove-none \
--remove-cycle
Here, trees are normalized by adding ROOT label, removing -NONE-,
removing X over X unaries and stripping off tags in each label. If you
add --leaf
flag, it will output only leaves, i.e. sentences. The
--pos
option can replace each POS tag in trees specified by a
POS-file consisting of a sequence of POSs for each word.
Second, we need to compute grammar from a treebank:
progs/trance_grammar \
--input [treebank file] \
--output [grammar file] \
--cutoff 3 \
--debug
By default, we use the cutoff threshold to 3 (--cutoff 3
)
indicating that the words which occur twice or less are mapped to
special token <unk>. For English or Chinese, it is better to use
word signature for better mapping OOVs by adding --signature
{English,Chinese}
option. The --debug
option is recommended
since it will output various information, most notable, the maximum
number of unary size, which is used during learning and testing via
--unary [maximum unary size]
option.
Third, learn a model:
progs/trance_learn \
--input [treebank file] \
--test [treebank development file] \
--output [model file] \
--grammar [grammar file] \
--unary [maximum unary size] \
--hidden [hidden dimension size] \
--embedding [word embedding dimension size] \
--beam 32 \
--kbest 128 \
--randomize \
--learn all:opt=adadec,violation=max,margin-all=true,batch=4,iteration=100,eta=1e-2,gamma=0.9,epsilon=1,lambda=1e-5 \
--mix-select \
--averaging \
--debug
Here, we use --input
option to specify training data and use
--test
for development data. The --output
will output a model
with the best evalb score under the development data. By default, we
will train Model5, but you can use different models by
--model[1-5]
options. The grammar file is learned by
trance_grammar
and if you specified --signature
option, you
have to use the same one. --unary
option should be the same as the
maximum unary size output by the trance_grammar
with --debug
option.
By default, we use the hidden size of 64 and embedding size of 1024, and
the model parameters are initialized randomly (--ramdomize
). You
can precompute word embedding by word2vec
or rnnlm, then use it as initial parameters for
word representation by --word-embedding [embedding file]
option. The format is as follows:
word1 param1 param2 ... param[embedding size] word2 param1 param2 ... param[embedding size] word3 param1 param2 ... param[embedding size]
The parameter estimation is performed by AdaDec with max-violation
considering expected mistakes (margin-all=true
) with hyperparameters
of eta=1e-2, gamma=0.9, epsilon=1, lambda=1e-5. The maximum number of
iterations is set to 100 with mini-batch size of 4, beam size of 32
and kbest size of 128, i.e., the beam size in the final bin. In each
iteration, we select the best model with respect to L1 norm
(--mix-select
) and performs averaging for model output
(--averaging
). This is a recommended setting employed in [1].
[1] | (1, 2, 3, 4, 5) Taro Watanabe and Eiichiro Sumita. Transition-based Neural Constituent Parsing. In Proc. of ACL 2015 (to appear). |