zoukankan      html  css  js  c++  java
  • 22、谷歌MMOE多任务学习模型(转)

    文章发表在KDD 2018 Research Track上,链接为Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

    一、摘要

    多任务学习可被用在许多应用上,如推荐系统。如在电影推荐中,用户可购买和喜欢观看偏好的电影,故可同时预测用户购买量以及对电影的打分。

    多任务学习常对任务之间的相关性较敏感,故权衡任务之间的目标以及任务内部关系十分重要。

    MMOE模型可用来学习任务之间的关系,本文采用MOE(专家模型)在多个任务之间通过共享专家子网络来进行多任务学习,其中设置一个门结构来训练优化每个任务。

    二、引言

    • 许多基于DNN的多任务学习存在着对数据分布不平衡、任务相关性等问题,内在的任务差异冲突会损害一些任务的预测。
    • 也有一些论文提出新的建模技术来处理多任务学习中的任务差异,但技术常设计为每个模型增加更多模型参数,导致计算开销变大。
    • MMOE:学习任务之间的关系,学习特定任务功能,自动分配参数捕获共享任务信息或特定任务信息,避免每次添加新参数。

    多任务模型通过学习不同任务的联系和差异,可提高每个任务的学习效率和质量。

    (1)多任务学习的的框架广泛采用shared-bottom的结构,不同任务间共用底部的隐层。

    这种结构本质上可以减少过拟合的风险,但是效果上可能受到任务差异和数据分布带来的影响。

    (2)也有一些其他结构,比如两个任务的参数不共用,但是通过对不同任务的参数增加L2范数的限制;也有一些对每个任务分别学习一套隐层然后学习所有隐层的组合。

    和shared-bottom结构相比,这些模型对增加了针对任务的特定参数,在任务差异会影响公共参数的情况下对最终效果有提升。

    缺点就是模型增加了参数量所以需要更大的数据量来训练模型,而且模型更复杂并不利于在真实生产环境中实际部署使用。

    因此,论文中提出了一个Multi-gate Mixture-of-Experts(MMoE)的多任务学习结构。MMoE模型刻画了任务相关性,基于共享表示来学习特定任务的函数,避免了明显增加参数的缺点。

    MMoE模型的结构(下图c)基于广泛使用的Shared-Bottom结构(下图a)和MoE结构,其中图(b)是图(c)的一种特殊情况。

    三、一般的多任务学习模型

    1、框架:

    如上图a所示,shared-bottom网络(表示为函数f)位于底部,多个任务共用这一层。往上,K个子任务分别对应一个tower network(表示为[公式] ),每个子任务的输出 [公式] 。

     

     

    2、任务相关性实验

    接下来,我们通过一个实验来探讨任务相关性和多任务学习效果的关系。

    假设模型中包含两个回归任务,而数据通过采样生成,并且规定输入相同,输出label不同。那么任务的相关性就使用label之间的皮尔逊相关系数来表示,相关系数越大,表示任务之间越相关,数据生成的过程如下:

    首先,生成了两个垂直的单位向量u1和u2,并根据两个单位向量生成了模型的系数w1和w2,如上图中的第二步。w1和w2之间的cosine距离即为p,大伙可以根据cosine的计算公式得到。

    随后基于正态分布的到输入数据x,而y根据下面的两个式子的到:

    注意,这里x和y之间并非线性的关系,因为模型的第二步是多个sin函数,因此label之间的皮尔逊相关系数和参数w1和w2之间的cosine距离并不相等,但是呈现出一个正相关的关系,如下图:

    因此,本文中使用参数的cosine距离来近似表示任务之间的相关性。

    3、实验结果

    基于上述数据生成过程以及任务相关性的表示方法,分别测试任务相关性在0.5、0.9和1时的多任务学习模型的效果,如下图:

    可以看到的是,随着任务相关性的提升,模型的loss越小,效果越好,从而印证了前面的猜想。

    四、MMOE模型

    1、MOE模型

    先来看一下Mixture-of-Experts (MoE)模型(文中后面称作 One-gate Mixture-of-Experts (OMoE)),如下图所示:

    可以看到,相较于一般的多任务学习框架,共享的底层分为了多个expert,同时设置了一个Gate,使不同的数据可以多样化的使用共享层。此时共享层的输出可以表示为:

    其中fi代表第i个expert的输出, [公式] 是n个expert network(expert network可认为是一个神经网络),gi代表第第i个expert对应的权重,是基于输入数据得到的,计算公式为g(x) = softmax(Wgx),其中 [公式] 。g是组合experts结果的gating network,具体来说g产生n个experts上的概率分布,最终的输出是所有experts的带权加和。显然,MoE可看做基于多个独立模型的集成方法。

    后面有些文章将MoE作为一个基本的组成单元,将多个MoE结构堆叠在一个大网络中。比如一个MoE层可以接受上一层MoE层的输出作为输入,其输出作为下一层的输入使用。

    2、 MMoE模型

    文章提出的模型(简称MMoE)目的就是相对于shared-bottom结构不明显增加模型参数的要求下捕捉任务的不同。其核心思想是将shared-bottom网络中的函数f替换成MoE层

    相较于MoE模型,Multi-gate Mixture-of-Experts (MMoE)模型为每一个task设置了一个gate,使不同的任务和不同的数据可以多样化的使用共享层,模型结构如下:

    此时每个任务的共享层的输出不同,第k个任务的共享层输出计算公式如下:

    输入就是input feature,输出是所有experts上的权重。一方面,因为gating networks通常是轻量级的,而且expert networks是所有任务共用,所以相对于论文中提到的一些baseline方法在计算量和参数量上具有优势。

    随后每个任务对应的共享层输出,经过多层全连接神经网络得到每个任务的输出:

    从直观上考虑,如果两个任务并不十分相关,那么经过Gate之后,二者得到的权重系数会差别比较大,从而可以利用部分expert网络输出的信息,近似于多个单任务学习模型。如果两个任务紧密相关,那么经过Gate得到的权重分布应该相差不多,类似于一般的多任务学习框架。

    相对于所有任务公共一个门控网络(One-gate MoE model,如上图b),这里MMoE(上图c)中每个任务使用单独的gating networks。每个任务的gating networks通过最终输出权重不同实现对experts的选择性利用。不同任务的gating networks可以学习到不同的组合experts的模式,因此模型考虑到了捕捉到任务的相关性和区别。

    网络中export是切分的子网络,实现的时候其实可以看做是三维tensor,形状为:

    dim of input feature * number of units per expert * number of experts

    更新时是对这个三维tensor进行更新。

    gate的形状则为:
    dim of input feature * number of experts * number of tasks

    然后一点网络中的小小小details,贴在这里可以参考一下,帮助理解:

    f_{i}(x) = activation(W_{i} * x + b), where activation is ReLU according to the paper

    g^{k}(x) = activation(W_{gk} * x + b), where activation is softmax according to the paper
    f^{k}(x) = sum_{i=1}^{n}(g^{k}(x)_{i} * f_{i}(x))

     

    五、实验结果

    1 人工合成数据集

    下图是实验结果,OMoE是单门MoE。可以看到在相关性强的数据上,OMoE和MMoE差别不大,但是在相关性低的数据上,MMoE胜过其他两个方法很多。

    2、UCI census-income dataset

    3、Large-scale Content Recommendation

     六、主要代码

    1、导包

    import pandas as pd
    from keras.utils import to_categorical
    from keras import backend as K
    from keras.optimizers import Adam
    from keras.initializers import VarianceScaling
    from keras.layers import Input, Dense
    from keras.models import Model
    from keras.callbacks import Callback
    from sklearn.metrics import roc_auc_score
    
    import numpy as np
    import random
    
    import tensorflow as tf
    from mmoe import MMoE #模型代码
    
    SEED = 1
    
    # Fix numpy seed for reproducibility
    np.random.seed(SEED)
    
    # Fix random seed for reproducibility
    random.seed(SEED)
    
    # Fix TensorFlow graph-level seed for reproducibility
    tf.set_random_seed(SEED)
    
    
    #设置tensorflow的session

     2、加载数据---1994年income数据

    column_names = ['age', 'class_worker', 'det_ind_code', 'det_occ_code', 'education', 'wage_per_hour', 'hs_college',
                    'marital_stat', 'major_ind_code', 'major_occ_code', 'race', 'hisp_origin', 'sex', 'union_member',
                    'unemp_reason', 'full_or_part_emp', 'capital_gains', 'capital_losses', 'stock_dividends',
                    'tax_filer_stat', 'region_prev_res', 'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ',
                    'instance_weight', 'mig_chg_msa', 'mig_chg_reg', 'mig_move_reg', 'mig_same', 'mig_prev_sunbelt',
                    'num_emp', 'fam_under_18', 'country_father', 'country_mother', 'country_self', 'citizenship',
                    'own_or_self', 'vet_question', 'vet_benefits', 'weeks_worked', 'year', 'income_50k']
    
    # Load the dataset in Pandas
    train_df = pd.read_csv(
        'data/census-income.data.gz',
        delimiter=',',
        header=None,
        index_col=None,
        names=column_names
    )
    other_df = pd.read_csv(
        'data/census-income.test.gz',
        delimiter=',',
        header=None,
        index_col=None,
        names=column_names
    )

    切分feature和label

    label_columns = ['income_50k', 'marital_stat']
    
    # One-hot encoding categorical columns
    categorical_columns = ['class_worker', 'det_ind_code', 'det_occ_code', 'education', 'hs_college', 'major_ind_code',
                           'major_occ_code', 'race', 'hisp_origin', 'sex', 'union_member', 'unemp_reason',
                           'full_or_part_emp', 'tax_filer_stat', 'region_prev_res', 'state_prev_res', 'det_hh_fam_stat',
                           'det_hh_summ', 'mig_chg_msa', 'mig_chg_reg', 'mig_move_reg', 'mig_same', 'mig_prev_sunbelt',
                           'fam_under_18', 'country_father', 'country_mother', 'country_self', 'citizenship',
                           'vet_question']
    train_raw_labels = train_df[label_columns]
    other_raw_labels = other_df[label_columns]
    transformed_train = pd.get_dummies(train_df.drop(label_columns, axis=1), columns=categorical_columns)
    transformed_other = pd.get_dummies(other_df.drop(label_columns, axis=1), columns=categorical_columns)

    打标签

    transformed_other['det_hh_fam_stat_ Grandchild <18 ever marr not in subfamily'] = 0
    
    # One-hot encoding categorical labels
    train_income = to_categorical((train_raw_labels.income_50k == ' 50000+.').astype(int), num_classes=2)   # > 5000的为1, < 5000为0
    train_marital = to_categorical((train_raw_labels.marital_stat == ' Never married').astype(int), num_classes=2)  ## Never married为1, married为0
    other_income = to_categorical((other_raw_labels.income_50k == ' 50000+.').astype(int), num_classes=2) 
    other_marital
    = to_categorical((other_raw_labels.marital_stat == ' Never married').astype(int), num_classes=2)
    dict_outputs = {
        'income': train_income.shape[1],
        'marital': train_marital.shape[1]
    }  ## dict_outputs = {'income' : 2, 'marital' : 2}
    dict_train_labels = { 'income': train_income, 'marital': train_marital } 
    dict_other_labels
    = { 'income': other_income, 'marital': other_marital }
    output_info
    = [(dict_outputs[key], key) for key in sorted(dict_outputs.keys())] ## output_info = [(2, 'income'), (2, 'marital')]

    切分验证集和测试集、训练集

    # Split the other dataset into 1:1 validation to test according to the paper
    validation_indices = transformed_other.sample(frac=0.5, replace=False, random_state=SEED).index
    test_indices = list(set(transformed_other.index) - set(validation_indices))
    validation_data = transformed_other.iloc[validation_indices]
    validation_label = [dict_other_labels[key][validation_indices] for key in sorted(dict_other_labels.keys())]
    test_data = transformed_other.iloc[test_indices]
    test_label = [dict_other_labels[key][test_indices] for key in sorted(dict_other_labels.keys())]
    train_data = transformed_train
    train_label = [dict_train_labels[key] for key in sorted(dict_train_labels.keys())]
    
    num_features = train_data.shape[1]
    print('Training data shape = {}'.format(train_data.shape))
    print('Validation data shape = {}'.format(validation_data.shape))
    print('Test data shape = {}'.format(test_data.shape))
    
    
    ############
    # Training data shape = (199523, 499)
    # Validation data shape = (49881, 499)
    # Test data shape = (49881, 499)

    3、 模型构建

    输入层

    input_layer = Input(shape=(num_features,))

    MMOE层

    mmoe_layers = MMoE(
        units=4,
        num_experts=8,
        num_tasks=2
    )(input_layer)

    output_layers = []

    MMOE代码类:

    from keras import backend as K
    from keras import activations, initializers, regularizers, constraints
    from keras.engine.topology import Layer, InputSpec
    
    
    class MMoE(Layer):
        """
        Multi-gate Mixture-of-Experts model.
        """
    
        def __init__(self,
                     units,
                     num_experts,
                     num_tasks,
                     use_expert_bias=True,
                     use_gate_bias=True,
                     expert_activation='relu',
                     gate_activation='softmax',
                     expert_bias_initializer='zeros',
                     gate_bias_initializer='zeros',
                     expert_bias_regularizer=None,
                     gate_bias_regularizer=None,
                     expert_bias_constraint=None,
                     gate_bias_constraint=None,
                     expert_kernel_initializer='VarianceScaling',
                     gate_kernel_initializer='VarianceScaling',
                     expert_kernel_regularizer=None,
                     gate_kernel_regularizer=None,
                     expert_kernel_constraint=None,
                     gate_kernel_constraint=None,
                     activity_regularizer=None,
                     **kwargs):
            """
             Method for instantiating MMoE layer.
    
            :param units: Number of hidden units
            :param num_experts: Number of experts
            :param num_tasks: Number of tasks
            :param use_expert_bias: Boolean to indicate the usage of bias in the expert weights
            :param use_gate_bias: Boolean to indicate the usage of bias in the gate weights
            :param expert_activation: Activation function of the expert weights
            :param gate_activation: Activation function of the gate weights
            :param expert_bias_initializer: Initializer for the expert bias
            :param gate_bias_initializer: Initializer for the gate bias
            :param expert_bias_regularizer: Regularizer for the expert bias
            :param gate_bias_regularizer: Regularizer for the gate bias
            :param expert_bias_constraint: Constraint for the expert bias
            :param gate_bias_constraint: Constraint for the gate bias
            :param expert_kernel_initializer: Initializer for the expert weights
            :param gate_kernel_initializer: Initializer for the gate weights
            :param expert_kernel_regularizer: Regularizer for the expert weights
            :param gate_kernel_regularizer: Regularizer for the gate weights
            :param expert_kernel_constraint: Constraint for the expert weights
            :param gate_kernel_constraint: Constraint for the gate weights
            :param activity_regularizer: Regularizer for the activity
            :param kwargs: Additional keyword arguments for the Layer class
            """
            # Hidden nodes parameter
            self.units = units
            self.num_experts = num_experts
            self.num_tasks = num_tasks
    
            # Weight parameter
            self.expert_kernels = None
            self.gate_kernels = None
            self.expert_kernel_initializer = initializers.get(expert_kernel_initializer)
            self.gate_kernel_initializer = initializers.get(gate_kernel_initializer)
            self.expert_kernel_regularizer = regularizers.get(expert_kernel_regularizer)
            self.gate_kernel_regularizer = regularizers.get(gate_kernel_regularizer)
            self.expert_kernel_constraint = constraints.get(expert_kernel_constraint)
            self.gate_kernel_constraint = constraints.get(gate_kernel_constraint)
    
            # Activation parameter
            self.expert_activation = activations.get(expert_activation)
            self.gate_activation = activations.get(gate_activation)
    
            # Bias parameter
            self.expert_bias = None
            self.gate_bias = None
            self.use_expert_bias = use_expert_bias
            self.use_gate_bias = use_gate_bias
            self.expert_bias_initializer = initializers.get(expert_bias_initializer)
            self.gate_bias_initializer = initializers.get(gate_bias_initializer)
            self.expert_bias_regularizer = regularizers.get(expert_bias_regularizer)
            self.gate_bias_regularizer = regularizers.get(gate_bias_regularizer)
            self.expert_bias_constraint = constraints.get(expert_bias_constraint)
            self.gate_bias_constraint = constraints.get(gate_bias_constraint)
    
            # Activity parameter
            self.activity_regularizer = regularizers.get(activity_regularizer)
    
            # Keras parameter
            self.input_spec = InputSpec(min_ndim=2)
            self.supports_masking = True
    
            super(MMoE, self).__init__(**kwargs)
    
        def build(self, input_shape):
            """
            Method for creating the layer weights.
    
            :param input_shape: Keras tensor (future input to layer)
                                or list/tuple of Keras tensors to reference
                                for weight shape computations
            """
            assert input_shape is not None and len(input_shape) >= 2
    
            input_dimension = input_shape[-1]
    
            # Initialize expert weights (number of input features * number of units per expert * number of experts)
            self.expert_kernels = self.add_weight(
                name='expert_kernel',
                shape=(input_dimension, self.units, self.num_experts),
                initializer=self.expert_kernel_initializer,
                regularizer=self.expert_kernel_regularizer,
                constraint=self.expert_kernel_constraint,
            )
    
            # Initialize expert bias (number of units per expert * number of experts)
            if self.use_expert_bias:
                self.expert_bias = self.add_weight(
                    name='expert_bias',
                    shape=(self.units, self.num_experts),
                    initializer=self.expert_bias_initializer,
                    regularizer=self.expert_bias_regularizer,
                    constraint=self.expert_bias_constraint,
                )
    
            # Initialize gate weights (number of input features * number of experts * number of tasks)
            self.gate_kernels = [self.add_weight(
                name='gate_kernel_task_{}'.format(i),
                shape=(input_dimension, self.num_experts),
                initializer=self.gate_kernel_initializer,
                regularizer=self.gate_kernel_regularizer,
                constraint=self.gate_kernel_constraint
            ) for i in range(self.num_tasks)]
    
            # Initialize gate bias (number of experts * number of tasks)
            if self.use_gate_bias:
                self.gate_bias = [self.add_weight(
                    name='gate_bias_task_{}'.format(i),
                    shape=(self.num_experts,),
                    initializer=self.gate_bias_initializer,
                    regularizer=self.gate_bias_regularizer,
                    constraint=self.gate_bias_constraint
                ) for i in range(self.num_tasks)]
    
            self.input_spec = InputSpec(min_ndim=2, axes={-1: input_dimension})
    
            super(MMoE, self).build(input_shape)
    
        def call(self, inputs, **kwargs):
            """
            Method for the forward function of the layer.
    
            :param inputs: Input tensor
            :param kwargs: Additional keyword arguments for the base method
            :return: A tensor
            """
            gate_outputs = []
            final_outputs = []
    
            # f_{i}(x) = activation(W_{i} * x + b), where activation is ReLU according to the paper, expert_outputs = {batch_size, units per experts, numbers of experts}
            expert_outputs = K.tf.tensordot(a=inputs, b=self.expert_kernels, axes=1)
            # Add the bias term to the expert weights if necessary
            if self.use_expert_bias:
                expert_outputs = K.bias_add(x=expert_outputs, bias=self.expert_bias)
            expert_outputs = self.expert_activation(expert_outputs)
    
            # g^{k}(x) = activation(W_{gk} * x + b), where activation is softmax according to the paper, gate_output = { batch_size , 1}
            for index, gate_kernel in enumerate(self.gate_kernels):
                gate_output = K.dot(x=inputs, y=gate_kernel)
                # Add the bias term to the gate weights if necessary
                if self.use_gate_bias:
                    gate_output = K.bias_add(x=gate_output, bias=self.gate_bias[index])
                gate_output = self.gate_activation(gate_output)
                gate_outputs.append(gate_output)
    
            # f^{k}(x) = sum_{i=1}^{n}(g^{k}(x)_{i} * f_{i}(x))
            for gate_output in gate_outputs:
                expanded_gate_output = K.expand_dims(gate_output, axis=1)
                weighted_expert_output = expert_outputs * K.repeat_elements(expanded_gate_output, self.units, axis=1)
                final_outputs.append(K.sum(weighted_expert_output, axis=2))
    
            return final_outputs
    
        def compute_output_shape(self, input_shape):
            """
            Method for computing the output shape of the MMoE layer.
    
            :param input_shape: Shape tuple (tuple of integers)
            :return: List of input shape tuple where the size of the list is equal to the number of tasks
            """
            assert input_shape is not None and len(input_shape) >= 2
    
            output_shape = list(input_shape)
            output_shape[-1] = self.units
            output_shape = tuple(output_shape)
    
            return [output_shape for _ in range(self.num_tasks)]
    
        def get_config(self):
            """
            Method for returning the configuration of the MMoE layer.
    
            :return: Config dictionary
            """
            config = {
                'units': self.units,
                'num_experts': self.num_experts,
                'num_tasks': self.num_tasks,
                'use_expert_bias': self.use_expert_bias,
                'use_gate_bias': self.use_gate_bias,
                'expert_activation': activations.serialize(self.expert_activation),
                'gate_activation': activations.serialize(self.gate_activation),
                'expert_bias_initializer': initializers.serialize(self.expert_bias_initializer),
                'gate_bias_initializer': initializers.serialize(self.gate_bias_initializer),
                'expert_bias_regularizer': regularizers.serialize(self.expert_bias_regularizer),
                'gate_bias_regularizer': regularizers.serialize(self.gate_bias_regularizer),
                'expert_bias_constraint': constraints.serialize(self.expert_bias_constraint),
                'gate_bias_constraint': constraints.serialize(self.gate_bias_constraint),
                'expert_kernel_initializer': initializers.serialize(self.expert_kernel_initializer),
                'gate_kernel_initializer': initializers.serialize(self.gate_kernel_initializer),
                'expert_kernel_regularizer': regularizers.serialize(self.expert_kernel_regularizer),
                'gate_kernel_regularizer': regularizers.serialize(self.gate_kernel_regularizer),
                'expert_kernel_constraint': constraints.serialize(self.expert_kernel_constraint),
                'gate_kernel_constraint': constraints.serialize(self.gate_kernel_constraint),
                'activity_regularizer': regularizers.serialize(self.activity_regularizer)
            }
            base_config = super(MMoE, self).get_config()
    
            return dict(list(base_config.items()) + list(config.items()))

    输出层(tower layer)

    # Build tower layer from MMoE layer
    for index, task_layer in enumerate(mmoe_layers):
        tower_layer = Dense(
            units=8,
            activation='relu',
            kernel_initializer=VarianceScaling())(task_layer)
        output_layer = Dense(
            units=output_info[index][0],
            name=output_info[index][1],
            activation='softmax',
            kernel_initializer=VarianceScaling())(tower_layer)
        output_layers.append(output_layer)

    4、模型训练

    model = Model(inputs=[input_layer], outputs=output_layers)
    adam_optimizer = Adam()
    model.compile(
        loss={'income':'binary_crossentropy'},
        optimizer=adam_optimizer,
        metrics=['accuracy']
    )
    # Print out model architecture summary
    model.summary()
    
    # Train the model
    model.fit(
        x=train_data,
        y=train_label,
        validation_data=(validation_data, validation_label),
        callbacks=[
            ROCCallback(
                training_data=(train_data, train_label),
                validation_data=(validation_data, validation_label),
                test_data=(test_data, test_label)
            )
        ],
        epochs=100
    )

    参考文献:

    https://zhuanlan.zhihu.com/p/55752344

    https://zhuanlan.zhihu.com/p/96796043

    多任务学习模型详解:Multi-gate Mixture-of-Experts(MMoE ,Google,KDD2018)

    MMOE论文笔记(论文中有维度讲解)

     
  • 相关阅读:
    hdu4059 The Boss on Mars
    cf475D CGCDSSQ
    HDU
    cf1447D Catching Cheaters
    cf1440 Greedy Shopping
    Treats for the Cows
    dp废物学会了记录路径
    D. Jzzhu and Cities
    cf1359D Yet Another Yet Another Task
    关于sg函数打表的理解
  • 原文地址:https://www.cnblogs.com/Lee-yl/p/13274401.html
Copyright © 2011-2022 走看看