-
Notifications
You must be signed in to change notification settings - Fork 2.2k
How to efficiently implement a decoder with character-level embedding? #3631
Comments
I think you basically want to get an embedding matrix for your vocabulary from the character CNN. This means getting an indexed representation of your entire vocabulary, then passing it through your character CNN. Then you can do a typical softmax over your vocabulary to get a distribution over the next word. Pretty messy if you have to do this at training time, but not bad if you just do it once for decoding. It turns out we do exactly this for a subset of the vocabulary for allennlp/allennlp/interpret/attackers/hotflip.py Lines 83 to 114 in 4b4d8be
Something like that should work for you, though that exact code is not what you want. Note that there are no for loops necessary for any of this, except for in the |
Thanks a lot! But I don't quite get what you meant by Pretty messy if you have to do this at training time, but not bad if you just do it once for decoding. Apparently I am going to need this embedding matrix during training, but I don't see why computing this embedding matrix is messy during training? |
You could do it in the dataset reader, and pass a reference to the whole vocab object in every |
But we need to re-compute the entire embedding matrix anyway right? So this is an inherent problem of using the character CNN for decoding. |
Depends on what you mean by "decoding". You just have to recompute it when it changes. After the model is trained, it's not going to change, and you can just compute it once. |
Considering we have a simple seq2seq model, and for the decoding side, I want to compute both the input embedding and output embedding using character-level representations (e.g, to embed a word we can apply a 1-d CNN filter over its characters, or even simpler, we directly use the average embeddings of its characters). It's convenient to use allennlp to implement the embedding layer, i.e., we can use a character-level indexer to convert each word into a list of ids, and then define an embedding layer over a list of ids. However, even words are embedded in character-level, during decoding I don't want to make predictions character by character. In another word, the prediction granularity should still be token level. So in order to compute the logits for different tokens, we still need an output embedding layer with size
(hidden_state, num_of_classes)
. Herenum_of_classes
means the total number of different tokens (not characters) of the output vocabulary. So it seems like we still need aSingleIdTokenIndexer
to inform us which tokens are in our output vocabulary, then each token should have two conversions, i.e., a single id fromSingleIdTokenIndexer
and a list of ids fromTokenCharactersIndexer
.Here the problem comes, now if I want to compute the output embedding (which is used to compute logits), I need to compute it column by column (there are
num_of_classes
columns in total). To compute theith
column, I need to first find theith
token from the vocabulary associates withSingleIdTokenIndexer
and then get the character-level conversion of this token so as to compute the embedding for it. This seems very inefficient to me. First you don't want to include too many for loops in your neural models, but here I need for loops to compute the output embedding. Secondly, to get theith
column of output embedding, I first converti
back into a token (get_token_from_index
), and then convert the token into its corresponding character-level index, which also looks awkward.I guess there must be more efficient ways to implement this since this is kinda a common need in NLP. Any suggestions will be greatly appreciated!
The text was updated successfully, but these errors were encountered: