Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

How to efficiently implement a decoder with character-level embedding? #3631

Closed
entslscheia opened this issue Jan 16, 2020 · 5 comments
Closed

Comments

@entslscheia
Copy link
Contributor

Considering we have a simple seq2seq model, and for the decoding side, I want to compute both the input embedding and output embedding using character-level representations (e.g, to embed a word we can apply a 1-d CNN filter over its characters, or even simpler, we directly use the average embeddings of its characters). It's convenient to use allennlp to implement the embedding layer, i.e., we can use a character-level indexer to convert each word into a list of ids, and then define an embedding layer over a list of ids. However, even words are embedded in character-level, during decoding I don't want to make predictions character by character. In another word, the prediction granularity should still be token level. So in order to compute the logits for different tokens, we still need an output embedding layer with size (hidden_state, num_of_classes). Here num_of_classes means the total number of different tokens (not characters) of the output vocabulary. So it seems like we still need a SingleIdTokenIndexer to inform us which tokens are in our output vocabulary, then each token should have two conversions, i.e., a single id from SingleIdTokenIndexer and a list of ids from TokenCharactersIndexer.
Here the problem comes, now if I want to compute the output embedding (which is used to compute logits), I need to compute it column by column (there are num_of_classes columns in total). To compute the ith column, I need to first find the ith token from the vocabulary associates with SingleIdTokenIndexer and then get the character-level conversion of this token so as to compute the embedding for it. This seems very inefficient to me. First you don't want to include too many for loops in your neural models, but here I need for loops to compute the output embedding. Secondly, to get the ith column of output embedding, I first convert i back into a token (get_token_from_index), and then convert the token into its corresponding character-level index, which also looks awkward.
I guess there must be more efficient ways to implement this since this is kinda a common need in NLP. Any suggestions will be greatly appreciated!

@matt-gardner
Copy link
Contributor

I think you basically want to get an embedding matrix for your vocabulary from the character CNN. This means getting an indexed representation of your entire vocabulary, then passing it through your character CNN. Then you can do a typical softmax over your vocabulary to get a distribution over the next word. Pretty messy if you have to do this at training time, but not bad if you just do it once for decoding.

It turns out we do exactly this for a subset of the vocabulary for Hotflip. You can see the code here:

def _construct_embedding_matrix(self) -> Embedding:
"""
For HotFlip, we need a word embedding matrix to search over. The below is necessary for
models such as ELMo, character-level models, or for models that use a projection layer
after their word embeddings.
We run all of the tokens from the vocabulary through the TextFieldEmbedder, and save the
final output embedding. We then group all of those output embeddings into an "embedding
matrix".
"""
embedding_layer = util.find_embedding_layer(self.predictor._model)
self.embedding_layer = embedding_layer
if isinstance(embedding_layer, (Embedding, torch.nn.modules.sparse.Embedding)):
# If we're using something that already has an only embedding matrix, we can just use
# that and bypass this method.
return embedding_layer.weight
# We take the top `self.max_tokens` as candidates for hotflip. Because we have to
# construct a new vector for each of these, we can't always afford to use the whole vocab,
# for both runtime and memory considerations.
all_tokens = list(self.vocab._token_to_index[self.namespace])[: self.max_tokens]
max_index = self.vocab.get_token_index(all_tokens[-1], self.namespace)
self.invalid_replacement_indices = [
i for i in self.invalid_replacement_indices if i < max_index
]
inputs = self._make_embedder_input(all_tokens)
# pass all tokens through the fake matrix and create an embedding out of it.
embedding_matrix = embedding_layer(inputs).squeeze()
return embedding_matrix

Something like that should work for you, though that exact code is not what you want. Note that there are no for loops necessary for any of this, except for in the _make_embedder_input function, where we loop over the vocabulary in a list comprehension.

@entslscheia
Copy link
Contributor Author

entslscheia commented Jan 16, 2020

Thanks a lot! But I don't quite get what you meant by Pretty messy if you have to do this at training time, but not bad if you just do it once for decoding. Apparently I am going to need this embedding matrix during training, but I don't see why computing this embedding matrix is messy during training?
Also, is the job of _make_embedder_input just to convert every token in the vocabulary into the corresponding input to the character CNN? Then I think it's in a sense similar to the way to convert ith token into the input to the character CNN that I mentioned in my original question. The only difference is that in your code, instead of computing the embedding matrix column by column, you compute the entire embedding matrix given the input of the entire vocabulary. I think I can do something similar in my implementation. It's not hard to group the input together. What I really don't like is that to build the embedding matrix we need to do index conversion job (i.e., convert a token into an input to the character CNN) inside our model, while I was hoping this kind of stuff can be done only within datareader/iterator. But looks like we have to do this anyway, so maybe it's not a problem.

@matt-gardner
Copy link
Contributor

You could do it in the dataset reader, and pass a reference to the whole vocab object in every Instance. The annoying thing is that it's really a global data object, and we don't have a good mechanism for passing this through to the model (see #1809). But adding the same object as a Field to every Instance would work, and you just embed it once per batch. What I meant by "messy" is "expensive", because you have to re-compute the entire embedding matrix every time your weights are updated.

@entslscheia
Copy link
Contributor Author

You could do it in the dataset reader, and pass a reference to the whole vocab object in every Instance. The annoying thing is that it's really a global data object, and we don't have a good mechanism for passing this through to the model (see #1809). But adding the same object as a Field to every Instance would work, and you just embed it once per batch. What I meant by "messy" is "expensive", because you have to re-compute the entire embedding matrix every time your weights are updated.

But we need to re-compute the entire embedding matrix anyway right? So this is an inherent problem of using the character CNN for decoding.

@matt-gardner
Copy link
Contributor

Depends on what you mean by "decoding". You just have to recompute it when it changes. After the model is trained, it's not going to change, and you can just compute it once.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants