NLP and seq models

DistillBERT

teacher-student paradiam. Distillation loss $L = \sum_c t_c * log(s_c)$
- CrossEntropy loss with soft label: here $t_c$ is soft
- compare with MSE loss.
softmax-temperatrue, it's like label smoothing
supervised training loss $L_{mlm}$
cosine embedding loss $L_{cos}$
use RoBERTa train tricks: very large batch (use gradient accumulation impl), dynamic mask (my impl in BERT code), without nsp task.
Student Arch: 只减少层数num_layers 而d_model不变，因为

Most of the operations ... are highly optimized in modern linear algebra frameworks and ... variations on the last dimension of the tensor (hidden size dimension) have a smaller impact on computation efﬁciency. Thus we focus on reducing the number of layers.

BERT

code/nlp/bert/

亿点实现细节

数据集
- 数据集大小：完全没有想象中那么多，BookCorpors压缩后1.1G 解压后4.6G. txt 和 int 都可以完全放进内存
- Vocab构建速度：立等可取，纯python无优化
Model Parallel. base and large 实现了pytorch模型并行。关于效率问题的讨论 Summarize: pytorch model parallel 就是打不满GPU, pipeline又太麻烦。
model struct:

def forward(self, batch):
    is_next, sx, sy, msk, seg = batch
    # sx, sy, msk, seg (seq_len, bz); is_next (bz,)
    src = self.emb(sx) + self.seg_emb(seg) + self.__pos_emb(sx.shape[0])
    mem = self.encoder(src, src_key_padding_mask=self.__key_padding_mask(sx))
    # next sentence prediction
    nsp_yh = self.nsp_module(mem[0, :, :]).view(-1)  # (bz, )
    nsp_loss = self.nsp_loss_fn(nsp_yh, is_next.float())
    # masked language model
    mlm_yh = self.mlm_module(mem)  # (seq_len, bz, VOCAB_SIZE)
    mlm_loss_all = self.mlm_loss_fn(mlm_yh.flatten(0,1), sy.flatten(0,1))
    mlm_loss = torch.mean(mlm_loss_all[msk.flatten(0,1)])
    return mlm_loss + nsp_loss

Attention is all you need

code/nlp/seq2seq_tfm.py

src -> Encoder -> memroy -> +tgt decoder -> yh && shifted_y -> loss && backward
evaluation: predict yh step-by-step

seq2seq machine translation

code/nlp/seq2seq_lstm.py

padding. BucketIterator minimize total num of padding, by batching similar seq_len records together
decoder is step-by-step, use teacher_forcing with a prob for some token

实现这个最初是因为自己实现GNN遇到“同一个batch node个数不同”的问题，想参考一下NLP中类似问题如何处理，这里用了padding的方法。回到GNN的问题，一般就两种思路：

flatten, ie. batch small grpah to one large unconnected garph.
- This works for the most common case that graph struct is invarient, and is torch_geometric's way
- But if we need to change graph struct through layers, eg. GraphPooling, should tackle to not generate edge accross originaly different graphs, maybe use something like batch_mask? yes! Ref to torch_geometric impl of dense_diff_pool (use mask: BoolTensor (bz, max_num_nodes)) and topk (use batch: LongTensor [0,0,1,1,1,2,2...9])
padding && pass padding_mask
- LSTM do not consider anything like padding_mask, we can consider this in loss function CrossEntropyLoss(igore_index=PADDING_IDX)
- (TODO) TransformerEncoder accept padding mask: src_key_padding_mask

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLP_and_seq_models.md

NLP_and_seq_models.md

NLP and seq models

DistillBERT

BERT

Attention is all you need

seq2seq machine translation

Files

NLP_and_seq_models.md

Latest commit

History

NLP_and_seq_models.md

File metadata and controls

NLP and seq models

DistillBERT

BERT

Attention is all you need

seq2seq machine translation