zoukankan      html  css  js  c++  java
  • 知识蒸馏基本知识及其实现库介绍

    1 前言




    2 知识蒸馏的开山之作

    Hinton 在论文: Distilling the Knowledge in a Neural Network 提出了知识蒸馏的方法。网上关于这方面的资料实在是太多了,我就简单总结下吧。
    损失函数:$$Loss = aL_{soft} + (1-a)L_{hard}$$

    [loss= frac{exp(z_i/T)}{sum^{}_jexp(z_j/T)} ]



    3 TinyBert

    3.1 基本思路介绍

    说到对Bert的蒸馏, 首先想到的方法就是用微调好的Bert作为TeacherModel去训练一个StudentModel,这正是TinyBert的做法。那么下面的问题就是选取什么模型作为StudentModel,这个已经有一些尝试了,比如有人使用BiLSTM,但是更多的人还是继续使用了Bert,只不过这个Bert会比原始的Bert小。在TinyBert中,StudentModel使用的是减少了embedding size、hidden size和num hidden layers的小bert。



    1. 用pretrained bert蒸馏一个pretrained TinyBert
    2. 用fine-tuned bert蒸馏一个fine-tuned TinyBert( 它的初始化参数就是第一步里pretrained TinyBert)

    3.2 损失函数




    • (m):整数,0到StudentModel层数之间
    • (S_m):StudentModel第m层的输出
    • (g(m)):映射函数,实际意义是让StudentModel的第m层学习TeacherModel第g(m)层的输出
    • (T_{g(m)}):TeacherModel的第g(m)层的输出
    • (M):StudentModel隐层数量,那么StudentModel第M+1层就是预测层的输出了(logits)
    • (L_{embd}(S_0,T_0)):word embedding层的损失函数,用的是MSE
    • (L_{hidden}和L_{attn}):hidden层和attention层的损失函数,都是MSE
    • (L_{pred}):预测层损失函数,用的交叉熵,其他文献也有用KL-Distance的,其实反向传播的时候都一样。


    3.3 实战经验

    1. 在硬件和数据有限的条件下,我们很难做预训练模型的蒸馏,但是可以借鉴TinyBERT的思路,直接做TaskSpecific的蒸馏,至于如何初始化模型,我有两个建议:要不直接拿原始Teacher模型初始化,要不找一个别人预训练好的小模型进行初始化。我直接用的RBT3模型初始化,效果很好。
    2. 蒸馏完StudentModel,一定要测StudentModel的泛化能力。
    3. 灵活一些,蒸馏学习目前没有一个统一的方法,有很多地方可以自己改一改试一试。

    4 DistilBert

    4.1 基本思路


    4.2 损失函数

    DistillBERT的损失函数:(L_{ce} + L_{mlm} + L_{cos})

    • (L_{ce}),StudentModel和TeacherModel logits的交叉熵
    • (L_{mlm}),StudentModel中遮挡语言模型的损失函数
    • (L_{cos}),StudentModel和TeacherModel hidden states的余弦损失函数,注意在TinyBERT里用的是MSE,而且还考虑了attention的MSE。

    5 BERT-of-Theseus



    6 MiniLM

    刚刚发布的一篇新论文, 也是关于BERT蒸馏的,我简单总结下三个创新点:

    1. 先用TeacherModel蒸馏一个中等模型,再用中等模型蒸馏一个较小的StudentModel。只有在StudentModel很小的时候才会这么做。
    2. 只对最后一个隐层做蒸馏,作者认为这样可以让StudentModel有更大的自由空间,而且这样对StudentModel架构的要求就变得宽松了
    3. 对于最后一个隐层主要是对attention权重做学习,具体可以去看论文


    7 知识蒸馏通用框架

    7.1 KnowledgeDistillation库


    1. 基于多层模型的知识蒸馏框架:便于新手阅读源码、学习入门(不再维护)
    2. examples:存放各类新的知识蒸馏算法范例代码(继续维护)



    # import packages  
    import torch  
    import logging  
    import numpy as np  
    from transformers import BertModel, BertConfig  
    from torch.utils.data import DataLoader, RandomSampler, TensorDataset  
    from knowledge_distillation import KnowledgeDistiller, MultiLayerBasedDistillationLoss  
    from knowledge_distillation import MultiLayerBasedDistillationEvaluator  
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')  
    # Some global variables  
    train_batch_size = 40  
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  
    learning_rate = 1e-5  
    num_epoch = 10  
    # define student and teacher model  
    # Teacher Model  
    bert_config = BertConfig(num_hidden_layers=12, hidden_size=60, intermediate_size=60, output_hidden_states=True,  
    teacher_model = BertModel(bert_config)  
    # Student Model  
    bert_config = BertConfig(num_hidden_layers=3, hidden_size=60, intermediate_size=60, output_hidden_states=True,  
    student_model = BertModel(bert_config)  
    ### Train data loader  
    input_ids = torch.LongTensor(np.random.randint(100, 1000, (100000, 50)))  
    attention_mask = torch.LongTensor(np.ones((100000, 50)))  
    token_type_ids = torch.LongTensor(np.zeros((100000, 50)))  
    train_data = TensorDataset(input_ids, attention_mask, token_type_ids)  
    train_sampler = RandomSampler(train_data)  
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=train_batch_size)  
    ### Train data adaptor  
    ### It is a function that turn batch_data (from train_dataloader) to the inputs of teacher_model and student_model  
    ### You can define your own train_data_adaptor. Remember the input must be device and batch_data.  
    ###  The output is either dict or tuple, but must be consistent with you model's input  
    def train_data_adaptor(device, batch_data):  
        batch_data = tuple(t.to(device) for t in batch_data)  
        batch_data_dict = {"input_ids": batch_data[0],  
                           "attention_mask": batch_data[1],  
                           "token_type_ids": batch_data[2], }  
        # In this case, the teacher and student use the same input  
      return batch_data_dict, batch_data_dict  
    ### The loss model is the key for this generation.  
    ### We have already provided a general loss model for distilling multi bert layer  
    ### In most cases, you can directly use this model.  
    #### First, we should define a distill_config which indicates how to compute ths loss between teacher and student.  
    #### distill_config is a list-object, each item indicates how to calculate loss.  
    #### It also defines which output of which layer to calculate loss.  
    #### It shoulde be consistent with your output_adaptor  
    distill_config = [  
        # means that compute a loss by their embedding_layer's embedding  
      {"teacher_layer_name": "embedding_layer", "teacher_layer_output_name": "embedding",  
         "student_layer_name": "embedding_layer", "student_layer_output_name": "embedding",  
         "loss": {"loss_function": "mse_with_mask", "args": {}}, "weight": 1.0  
        # means that compute a loss between teacher's bert_layer12's hidden_states and student's bert_layer3's hidden_states  
      {"teacher_layer_name": "bert_layer12", "teacher_layer_output_name": "hidden_states",  
         "student_layer_name": "bert_layer3", "student_layer_output_name": "hidden_states",  
         "loss": {"loss_function": "mse_with_mask", "args": {}}, "weight": 1.0  
        {"teacher_layer_name": "bert_layer12", "teacher_layer_output_name": "attention",  
         "student_layer_name": "bert_layer3", "student_layer_output_name": "attention",  
         "loss": {"loss_function": "attention_mse_with_mask", "args": {}}, "weight": 1.0  
        {"teacher_layer_name": "pred_layer", "teacher_layer_output_name": "pooler_output",  
         "student_layer_name": "pred_layer", "student_layer_output_name": "pooler_output",  
         "loss": {"loss_function": "mse", "args": {}}, "weight": 1.0  
    ### teacher_output_adaptor and student_output_adaptor  
    ### In most cases, model's output is tuple-object, However, in our package, we need the output is dict-object,  
    ### like: { "layer_name":{"output_name":value} .... }  
    ### Hence, the output adaptor is to turn your model's output to dict-object output  
    ### In my case, teacher and student can use one adaptor  
    def output_adaptor(model_output):  
        last_hidden_state, pooler_output, hidden_states, attentions = model_output  
        output = {"embedding_layer": {"embedding": hidden_states[0]}}  
        for idx in range(len(attentions)):  
            output["bert_layer" + str(idx + 1)] = {"hidden_states": hidden_states[idx + 1],  
                                                   "attention": attentions[idx]}  
        output["pred_layer"] = {"pooler_output": pooler_output}  
        return output  
    # loss_model  
    loss_model = MultiLayerBasedDistillationLoss(distill_config=distill_config,  
    # optimizer  
    param_optimizer = list(student_model.named_parameters())  
    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']  
    optimizer_grouped_parameters = [  
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},  
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}  
    optimizer = torch.optim.Adam(params=optimizer_grouped_parameters, lr=learning_rate)  
    # evaluator  
    # this is a basic evalator, it can output loss value and save models  
    # You can define you own evaluator class that implements the interface IEvaluator  
    evaluator = MultiLayerBasedDistillationEvaluator(save_dir="save_model", save_step=1000, print_loss_step=20)  
    # Get a KnowledgeDistiller  
    distiller = KnowledgeDistiller(teacher_model=teacher_model, student_model=student_model,  
                                   train_dataloader=train_dataloader, dev_dataloader=None,  
                                   train_data_adaptor=train_data_adaptor, dev_data_adaptor=None,  
                                   device=device, loss_model=loss_model, optimizer=optimizer,  
                                   evaluator=evaluator, num_epoch=num_epoch)  
    # start distillate  
    7.2 TextBrewer库



    import torch
    import numpy as np
    import pickle
    import textbrewer
    from textbrewer import GeneralDistiller
    from textbrewer import TrainingConfig, DistillationConfig
    from transformers import BertConfig, BertModel
    from torch.utils.data import DataLoader, RandomSampler, TensorDataset
    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
    ## 定义模型
    bert_config = BertConfig(num_hidden_layers=12, output_hidden_states=True, output_attentions=True)
    teacher_model = BertModel(bert_config).to(device)
    bert_config = BertConfig(num_hidden_layers=3, output_hidden_states=True, output_attentions=True)
    student_model = BertModel(bert_config).to(device)
    # optimizer
    param_optimizer = list(student_model.named_parameters())
    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    optimizer = torch.optim.Adam(params=optimizer_grouped_parameters, lr=2e-5)
    ### data
    input_ids = torch.LongTensor(np.random.randint(100, 1000, (100000, 64)))
    attention_mask = torch.LongTensor(np.ones((100000, 64)))
    token_type_ids = torch.LongTensor(np.zeros((100000, 64)))
    train_data = TensorDataset(input_ids, attention_mask, token_type_ids)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=16)
    # Define an adaptor for translating the model inputs and outputs
    # 整合成蒸馏器需要的数据格式
    # key需要是固定的???
    def bert_adaptor(batch, model_outputs):
        last_hidden_state, pooler_output, hidden_states, attentions = model_outputs
        hidden_states = list(hidden_states)
        output = {"inputs_mask": batch[1],
                  "attention": attentions,
                  "hidden": hidden_states}
        return output
    # Training configuration
    train_config = TrainingConfig(gradient_accumulation_steps=1,
    # Distillation configuration
    # Matching different layers of the student and the teacher
    # 重要,如何蒸馏的定义
    # 不支持自定义损失函数
    # 不支持CLS LOSS,但是可以强行写在hidden loss里面
    distill_config = DistillationConfig(
            {'layer_T': 0, 'layer_S': 0, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1},  # embedding loss
            {'layer_T': 4, 'layer_S': 1, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1},  # hidden loss
            {'layer_T': 8, 'layer_S': 2, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1},
            {'layer_T': 12, 'layer_S': 3, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1},
            {'layer_T': 3, 'layer_S': 0, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1},  # attention loss
            {'layer_T': 7, 'layer_S': 1, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1},
            {'layer_T': 11, 'layer_S': 2, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1},
            {'layer_T': 12, 'layer_S': 3, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1},  # 其实是CLS loss
    # Build distiller
    distiller = GeneralDistiller(
        train_config=train_config, distill_config=distill_config,
        model_T=teacher_model, model_S=student_model,
        adaptor_T=bert_adaptor, adaptor_S=bert_adaptor)
    # Start!
    # callbacker 可以在dev上进行评估
    # 注意存的是state_dict
    with distiller:
        distiller.train(optimizer=optimizer, scheduler=None, dataloader=train_dataloader, num_epochs=10, callback=None)

    8 其它加速BERT的方法


    1. 提升硬件,目前看性价比较高的是RTX30系列显卡
    2. 提升深度学习框架版本必然能提升训练和推理速度。比如高版本的TensorFlow会支持mkldnn,AVX指令集。
    3. ONNXRuntime(这个主要用在推理中)
    4. BERT的量化
    5. StructedDropout了解一下,但是这个最好用在预训练上,那不然效果不好,ICLR2020的最新论文:reducing transformer depth on demand with structured dropout

    文章可以转载, 但请注明出处:

  • 相关阅读:
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    redis-cluster 官方文档
  • 原文地址:https://www.cnblogs.com/infgrad/p/13767918.html
Copyright © 2011-2022 走看看