zoukankan      html  css  js  c++  java
  • XLNet and Robertra

    XLNET

       

       

    But the AE language model also has its disadvantages. It uses the [MASK] in the pretraining, but this kind of artificial symbols are absent from the real data at finetuning time, resulting in a pretrain-finetune discrepancy.Another disadvantage of [MASK] is that it assumes the predicted (masked) tokens are independent of each other given the unmasked tokens. For example, we have a sentence "It shows that the housing crisis was turned into a banking crisis". We mask "banking" and "crisis". Attention here, we know the masked "banking" and "crisis" contains implicit relation to each other. But AE model is trying to predict "banking" given unmasked tokens, and predict "crisis" given unmasked tokens separately. It ignores the relation between "banking" and "crisis". In other words, it assumes the predicted (masked) tokens are independent of each other. But we know the model should learn such correlation among the predicted (masked) tokens to predict one of the tokens.

       

       

       

    A traditional language model would predict the tokens in the order

    "I", "like", "cats", "more", "than", "dogs"

    where each token uses all previous tokens as context.

    In permutation language modeling, the order of prediction is not necessarily left to right and is sampled randomly instead. For instance, it could be

    "cats", "than", "I", "more", "dogs", "like"

    where "than" would be conditioned on seeing "cats", "I" would be conditioned on seeing "catsthan" and so on. The following animation demonstrates this.

       

    输入看上去仍然是x1,x2,x3,x4,可以通过不同的掩码矩阵,让当前单词Xi只能看到被排列组合后的顺序x3->x2->x4->x1中自己前面的单词

       

       

       

       

    Robertra

       

    BERT的差别

    Facebook的研究人员发现超参数选择对BERT的最终结果有重大影响,因此他们重新研究了BERT的预训练模型,测量了不同超参数和训练集大小的影响,结果发现BERT存在明显的训练不足。

    经过调整后,BERT可以达到或超过其后发布的每个模型的性能,这些结果突出了之前被忽视的设计选择的重要性,

    RoBERTaBERT的不同之处在于,它依赖于预训练大量数据和改变训练数据的mask模式,而且RoBERTa删除了下一句预测(NSP)。

    RoBERTa的修改很简单,包括:

    1. 更长时间的训练时间,更大的batch,更多的数据;
    2. 删除下一句预测(NSP)目标;
    3. 在较长序列上进行训练;
    4. 动态改变用于训练数据的mask模式。(The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask. To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training. Thus, each training sequence was seen with the same mask four times during training.

       

    参考:

       

    https://zhuanlan.zhihu.com/p/75856238

    https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-bert-8d8fce710335

    https://zhuanlan.zhihu.com/p/70257427

    https://mlexplained.com/2019/06/30/paper-dissected-xlnet-generalized-autoregressive-pretraining-for-language-understanding-explained/

  • 相关阅读:
    python 集合
    jQuery选择器
    hdu 5747 Aaronson
    hdu 2049 不容易系列之(4)——考新郎
    hdu 2048 神、上帝以及老天爷
    hdu 2045 不容易系列之(3)—— LELE的RPG难题
    hdu 2047 阿牛的EOF牛肉串
    hdu 2046 骨牌铺方格
    hdu 2050 折线分割平面
    hdu 2044 一只小蜜蜂
  • 原文地址:https://www.cnblogs.com/wuxiangli/p/11279245.html
Copyright © 2011-2022 走看看