关于 PLATO-2 和 PLATO 的模型区别 #144

kiseliu · 2022-06-09T14:14:16Z

除了论文中提到的 pre-norm 和 post-norm 的区别，以及 tokenizer 的区别，

我对比了下 plato 的网络结构和 plato-2 (stage 2.1 PLATO模型) 的网络结构，发现也有细微区别：

1、在预测 latent variable 的时候，plato 1 中的实现的是 mask token 的 final hidden state 经过 post_network；而plato-2 中，我理解 recognition_fc 这一层是为了取出 mask token 的 final hidden state，然后 post_network 用 (latent_embedding, recognition_bias) 给替代了；

2、plato 1 中，计算 NLL loss 的时候(generation network)，response 中所有 token 的 final hidden states，上面没有接分类器，而是和 word embedding 共享参数；而 plato 2 中，response 中所有 token 的 final hidden states，还要经过一层 mask_lm_trans_fc 和一层 layer norm，然后和 word embedding 共享参数时，还多了个偏置 mask_lm_out_fc.b_0；

3、计算 bow loss 的时候，和计算 NLL loss 的改动一样，多了一层 bow_trans_fc 和一层 layer norm，以及偏置 bow_out_fc.b_0；

我不知道上述理解是否正确，以及这种改动上的设计是为了？

sserdoubleh · 2022-06-10T13:25:18Z

网络的变动，对模型的效果差异不大，主要是为了对齐 BERT 的模型结构 / 更多地共享参数

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于 PLATO-2 和 PLATO 的模型区别 #144

关于 PLATO-2 和 PLATO 的模型区别 #144

kiseliu commented Jun 9, 2022

sserdoubleh commented Jun 10, 2022

关于 PLATO-2 和 PLATO 的模型区别 #144

关于 PLATO-2 和 PLATO 的模型区别 #144

Comments

kiseliu commented Jun 9, 2022

sserdoubleh commented Jun 10, 2022