How to efficiently implement a decoder with character-level embedding? #3631

entslscheia · 2020-01-16T01:25:38Z

Considering we have a simple seq2seq model, and for the decoding side, I want to compute both the input embedding and output embedding using character-level representations (e.g, to embed a word we can apply a 1-d CNN filter over its characters, or even simpler, we directly use the average embeddings of its characters). It's convenient to use allennlp to implement the embedding layer, i.e., we can use a character-level indexer to convert each word into a list of ids, and then define an embedding layer over a list of ids. However, even words are embedded in character-level, during decoding I don't want to make predictions character by character. In another word, the prediction granularity should still be token level. So in order to compute the logits for different tokens, we still need an output embedding layer with size (hidden_state, num_of_classes). Here num_of_classes means the total number of different tokens (not characters) of the output vocabulary. So it seems like we still need a SingleIdTokenIndexer to inform us which tokens are in our output vocabulary, then each token should have two conversions, i.e., a single id from SingleIdTokenIndexer and a list of ids from TokenCharactersIndexer.
Here the problem comes, now if I want to compute the output embedding (which is used to compute logits), I need to compute it column by column (there are num_of_classes columns in total). To compute the ith column, I need to first find the ith token from the vocabulary associates with SingleIdTokenIndexer and then get the character-level conversion of this token so as to compute the embedding for it. This seems very inefficient to me. First you don't want to include too many for loops in your neural models, but here I need for loops to compute the output embedding. Secondly, to get the ith column of output embedding, I first convert i back into a token (get_token_from_index), and then convert the token into its corresponding character-level index, which also looks awkward.
I guess there must be more efficient ways to implement this since this is kinda a common need in NLP. Any suggestions will be greatly appreciated!

The text was updated successfully, but these errors were encountered:

matt-gardner · 2020-01-16T01:40:37Z

I think you basically want to get an embedding matrix for your vocabulary from the character CNN. This means getting an indexed representation of your entire vocabulary, then passing it through your character CNN. Then you can do a typical softmax over your vocabulary to get a distribution over the next word. Pretty messy if you have to do this at training time, but not bad if you just do it once for decoding.

It turns out we do exactly this for a subset of the vocabulary for Hotflip. You can see the code here:

allennlp/allennlp/interpret/attackers/hotflip.py

Lines 83 to 114 in 4b4d8be

    
               def _construct_embedding_matrix(self) -> Embedding: 
        
                   """ 
        
                   For HotFlip, we need a word embedding matrix to search over. The below is necessary for 
        
                   models such as ELMo, character-level models, or for models that use a projection layer 
        
                   after their word embeddings. 
        
                   We run all of the tokens from the vocabulary through the TextFieldEmbedder, and save the 
        
                   final output embedding. We then group all of those output embeddings into an "embedding 
        
                   matrix". 
        
                   """ 
        
                   embedding_layer = util.find_embedding_layer(self.predictor._model) 
        
                   self.embedding_layer = embedding_layer 
        
                   if isinstance(embedding_layer, (Embedding, torch.nn.modules.sparse.Embedding)): 
        
                       # If we're using something that already has an only embedding matrix, we can just use 
        
                       # that and bypass this method. 
        
                       return embedding_layer.weight 
        
                   # We take the top `self.max_tokens` as candidates for hotflip.  Because we have to 
        
                   # construct a new vector for each of these, we can't always afford to use the whole vocab, 
        
                   # for both runtime and memory considerations. 
        
                   all_tokens = list(self.vocab._token_to_index[self.namespace])[: self.max_tokens] 
        
                   max_index = self.vocab.get_token_index(all_tokens[-1], self.namespace) 
        
                   self.invalid_replacement_indices = [ 
        
                       i for i in self.invalid_replacement_indices if i < max_index 
        
                   ] 
        
                   inputs = self._make_embedder_input(all_tokens) 
        
                   # pass all tokens through the fake matrix and create an embedding out of it. 
        
                   embedding_matrix = embedding_layer(inputs).squeeze() 
        
                   return embedding_matrix

Something like that should work for you, though that exact code is not what you want. Note that there are no for loops necessary for any of this, except for in the _make_embedder_input function, where we loop over the vocabulary in a list comprehension.

entslscheia · 2020-01-16T02:35:55Z

Thanks a lot! But I don't quite get what you meant by Pretty messy if you have to do this at training time, but not bad if you just do it once for decoding. Apparently I am going to need this embedding matrix during training, but I don't see why computing this embedding matrix is messy during training?
Also, is the job of _make_embedder_input just to convert every token in the vocabulary into the corresponding input to the character CNN? Then I think it's in a sense similar to the way to convert ith token into the input to the character CNN that I mentioned in my original question. The only difference is that in your code, instead of computing the embedding matrix column by column, you compute the entire embedding matrix given the input of the entire vocabulary. I think I can do something similar in my implementation. It's not hard to group the input together. What I really don't like is that to build the embedding matrix we need to do index conversion job (i.e., convert a token into an input to the character CNN) inside our model, while I was hoping this kind of stuff can be done only within datareader/iterator. But looks like we have to do this anyway, so maybe it's not a problem.

matt-gardner · 2020-01-16T16:54:23Z

You could do it in the dataset reader, and pass a reference to the whole vocab object in every Instance. The annoying thing is that it's really a global data object, and we don't have a good mechanism for passing this through to the model (see #1809). But adding the same object as a Field to every Instance would work, and you just embed it once per batch. What I meant by "messy" is "expensive", because you have to re-compute the entire embedding matrix every time your weights are updated.

entslscheia · 2020-01-16T19:14:53Z

You could do it in the dataset reader, and pass a reference to the whole vocab object in every Instance. The annoying thing is that it's really a global data object, and we don't have a good mechanism for passing this through to the model (see #1809). But adding the same object as a Field to every Instance would work, and you just embed it once per batch. What I meant by "messy" is "expensive", because you have to re-compute the entire embedding matrix every time your weights are updated.

But we need to re-compute the entire embedding matrix anyway right? So this is an inherent problem of using the character CNN for decoding.

matt-gardner · 2020-01-16T20:07:04Z

Depends on what you mean by "decoding". You just have to recompute it when it changes. After the model is trained, it's not going to change, and you can just compute it once.

entslscheia closed this as completed Jan 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to efficiently implement a decoder with character-level embedding? #3631

How to efficiently implement a decoder with character-level embedding? #3631

entslscheia commented Jan 16, 2020

matt-gardner commented Jan 16, 2020

entslscheia commented Jan 16, 2020 •

edited

Loading

matt-gardner commented Jan 16, 2020

entslscheia commented Jan 16, 2020

matt-gardner commented Jan 16, 2020

How to efficiently implement a decoder with character-level embedding? #3631

How to efficiently implement a decoder with character-level embedding? #3631

Comments

entslscheia commented Jan 16, 2020

matt-gardner commented Jan 16, 2020

entslscheia commented Jan 16, 2020 • edited Loading

matt-gardner commented Jan 16, 2020

entslscheia commented Jan 16, 2020

matt-gardner commented Jan 16, 2020

entslscheia commented Jan 16, 2020 •

edited

Loading