zoukankan      html  css  js  c++  java
  • transformer基本架构及代码实现

    从2018年Google提出BERT模型开始,transformer结构就在NLP领域大杀四方,使用transformer的BERT模型在当时横扫NLP领域的11项任务,取得SOTA成绩,包括一直到后来相继出现的XLNET,roBERT等,均采用transformer结构作为核心。在著名的SOTA机器翻译排行榜上,几乎所有排名靠前的模型都是用transformer。那么在transformer出现之前,占领市场的一直都是LSTM和GRU等模型,相比之下,transformer具有如下两个显著的优势:

    1.transformer能够利用分布式GPU进行训练,从而提升模型的训练效率

    2.在分析预测长序列文本时,transformer能够捕捉间隔较长的语义关联效果。

    由于transformer在NLP领域的巨大成功,使得研究人员很自然的想到,如果将其应用于CV领域,又会取得怎样的效果呢,毕竟CV领域中的模型长期以来都是被CNNs主导,如果transformer能在CV领域进行适配和创新,是否能为CV模型的发展开辟一条新的道路。果然,近期transformer又在CV领域杀疯了,关于transformer的视觉模型在各大顶会论文中登场,其中又有不少模型实现了优于CNNs的效果。

    今天我们就从最原始的transformer模型入手,来带大家彻底认识一下transformer。

    transformer的架构

    transformer的总体架构如下图:

    从上图可以看到,transformer的总体架构可以分为四个部分:输入、输出、编码器和解码器,以机器翻译任务为例,各个部分的组成如下:

    输入部分(橙色区域)包含:

    1.源文本的嵌入层以及位置编码器

    2.目标文本的嵌入层以及位置编码器

    输出部分(蓝色区域)包含:

    1.线性层

    2.softmax层

    编码器部分(红色区域):

    1.由N个编码器层堆叠而成

    2.每个编码器层由两个子层连接结构组成

    3.第一个子层连接结构包括一个多头自注意力层和规范化层以及一个残差连接

    4.第二个子层连接结构包括一个前馈全连接子层和规范化层以及一个残差连接

    解码器部分(紫色区域):

    1.由N个解码器层堆叠而成

    2.每个解码器层由三个子层连接结构组成

    3.第一个子层连接结构包括一个多头自注意力子层和规范化层以及一个残差连接

    4.第二个子层连接结构包括一个多头注意力子层和规范化层以及一个残差连接

    5.第三个子层连接结构包括一个前馈全连接子层和规范化层以及一个残差连接

    输入部分:

    文本嵌入层(Input Embedding)作用:无论是从源文本嵌入还是目标文本嵌入,都是为了将文本中的词汇的数字表示转变为向量表示,希望在这样的高维空间捕捉词汇间的关系。

    Embedding代码实现:

     1 # 文本嵌入层
     2 class Embedding(Layer):
     3 
     4     '''
     5     :param vocab:词表大小
     6     :param dim_model:词嵌入的维度
     7     '''
     8     def __init__(self,vocab,dim_model,**kwargs):
     9         self._vocab = vocab
    10         self._dim_model = dim_model
    11         super(Embedding, self).__init__(**kwargs)
    12 
    13     def build(self, input_shape, **kwargs):
    14         self.embeddings = self.add_weight(
    15             shape=(self._vocab,self._dim_model),
    16             initializer='global_uniform',
    17             name='embeddings'
    18         )
    19         super(Embedding, self).build(input_shape)
    20 
    21     def call(self, inputs):
    22         if K.dtype(inputs) != 'int32':
    23             inputs = K.cast(inputs,'int32')
    24         embeddings = K.gather(self.embeddings,inputs)
    25         embeddings *= self._dim_model**0.5
    26         return embeddings
    27 
    28     def compute_output_shape(self, input_shape):
    29         return input_shape + (self._dim_model)

    位置编码层(Position Encoding)作用:因为在transformer编码器结构中并没有针对词汇位置信息的处理,因此需要在Embedding层后加入位置编码器,将词汇位置不同可能会产生不同语义的信息加入到词嵌入张量中,以弥补位置信息的缺失。

    PE计算公式:

    PE(pos,2i)=sin(pos/100002i/dmodel)

    PE(pos,2i+1)=cos(pos/100002i/dmodel)

    Position Encoding代码实现:

     1 # 位置编码层
     2 class PositionEncoding(Layer):
     3 
     4     '''
     5     :param dim_model:词嵌入维度
     6     '''
     7     def __init__(self,dim_model,**kwargs):
     8         self._dim_model = dim_model
     9         super(PositionEncoding, self).__init__(**kwargs)
    10 
    11     def call(self, inputs, **kwargs):
    12         seq_length = inputs.shape[1]
    13         position_encodings = np.zeros((seq_length, self._model_dim))
    14         for pos in range(seq_length):
    15             for i in range(self._model_dim):
    16                 position_encodings[pos, i] = pos / np.power(10000, (i - i % 2) / self._model_dim)
    17         position_encodings[:, 0::2] = np.sin(position_encodings[:, 0::2])  # 2i
    18         position_encodings[:, 1::2] = np.cos(position_encodings[:, 1::2])  # 2i+1
    19         position_encodings = K.cast(position_encodings, 'float32')
    20         return position_encodings
    21 
    22     def compute_output_shape(self, input_shape):
    23         return input_shape

    Embedding和Position Encoding相加层代码实现:

     1 # Embeddings和Position Encodings相加层
     2 class Add(Layer):
     3     def __init__(self,**kwargs):
     4         super(Add, self).__init__(**kwargs)
     5 
     6     def call(self, inputs, **kwargs):
     7         embeddings,positionEncodings = inputs
     8         return embeddings + positionEncodings
     9 
    10     def compute_output_shape(self, input_shape):
    11         return input_shape[0]

    编码器解码器组件实现

    相关概念:

      - 掩码张量:掩代表遮掩,码就是张量中的数值,它的尺寸不定,里面一般只有0 和 1 元素,代表位置被遮掩或者不被遮掩,因此它的作用就是让另外一个张量中的一些数值被遮掩,也可以说是被替换,它的表现形式是一个张量。

      - 掩码张量的作用:在transformer中,掩码张量的主要作用在应用attention,有一些生成的attention张量中的值计算有可能已知了未来信息而得到的,未来信息被看到是因为训练时会把整个输出结果都一次性进行Embedding,但是理论上解码器的输出却不是一次就能产生最终结果的,而是一次次的通过上一次结果综合得到的,因此,未来的信息可能被提前利用,这个时候就需要对未来信息进行遮掩。

      - Multi-Head Attention 是由多个Self-Attention 组成。从多头注意力的结构图中,我们看到貌似这个所谓的多头指的就是多组线性变变换层,其实并不是,这里其实仅使用了一组线性变换层,即三个变换张量对Q,K,V进行线性变换,这些变换并不会改变原有张量的尺度,因此每个变换张量都是方阵,得到结果后多头作用才开始体现,每个头从词义层面分割输出张量,但是句子中的每个词的表示只取得一部分,也就是只分割了最后一维的词嵌入向量(words embedding)。

      - self-attention和multi-head attention的结构如下图。在计算中需要用到矩阵Q(query),K(key),V(value),实际接收的输入是单词的表示向量组成的矩阵X或上一个编码器的输出,Q,K,V通过将输入进行线性变换得到。

    Self-Attention 层代码实现:

     1 # 自注意力层
     2 class ScaledDotProductAttention(Layer):
     3     def __init__(self, masking=True, future=False, dropout_rate=0., **kwargs):
     4         self._masking = masking
     5         self._future = future
     6         self._dropout_rate = dropout_rate
     7         self._masking_num = -2 ** 32 + 1
     8         super(ScaledDotProductAttention, self).__init__(**kwargs)
     9 
    10     def mask(self, inputs, masks):
    11         masks = K.cast(masks, 'float32')
    12         masks = K.tile(masks, [K.shape(inputs)[0] // K.shape(masks)[0], 1])
    13         masks = K.expand_dims(masks, 1)
    14         outputs = inputs + masks * self._masking_num
    15         return outputs
    16 
    17     def future_mask(self, inputs):
    18         diag_vals = tf.ones_like(inputs[0, :, :])
    19         tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense()
    20         future_masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(inputs)[0], 1, 1])
    21         paddings = tf.ones_like(future_masks) * self._masking_num
    22         outputs = tf.where(tf.equal(future_masks, 0), paddings, inputs)
    23         return outputs
    24 
    25     def call(self, inputs, **kwargs):
    26         if self._masking:
    27             assert len(inputs) == 4, "inputs should be set [queries, keys, values, masks]."
    28             queries, keys, values, masks = inputs
    29         else:
    30             assert len(inputs) == 3, "inputs should be set [queries, keys, values]."
    31             queries, keys, values = inputs
    32 
    33         if K.dtype(queries) != 'float32':  queries = K.cast(queries, 'float32')
    34         if K.dtype(keys) != 'float32':  keys = K.cast(keys, 'float32')
    35         if K.dtype(values) != 'float32':  values = K.cast(values, 'float32')
    36 
    37         matmul = K.batch_dot(queries, tf.transpose(keys, [0, 2, 1]))  # MatMul
    38         scaled_matmul = matmul / int(queries.shape[-1]) ** 0.5  # Scale
    39         if self._masking:
    40             scaled_matmul = self.mask(scaled_matmul, masks)  # Mask(opt.)
    41 
    42         if self._future:
    43             scaled_matmul = self.future_mask(scaled_matmul)
    44 
    45         softmax_out = K.softmax(scaled_matmul)  # SoftMax
    46         # Dropout
    47         out = K.dropout(softmax_out, self._dropout_rate)
    48 
    49         outputs = K.batch_dot(out, values)
    50 
    51         return outputs
    52 
    53     def compute_output_shape(self, input_shape):
    54         return input_shape

    Multi-Head Attention层代码实现:

     1 # 多头自注意力层
     2 class MultiHeadAttention(Layer):
     3 
     4     def __init__(self, n_heads, head_dim, dropout_rate=.1, masking=True, future=False, trainable=True, **kwargs):
     5         self._n_heads = n_heads
     6         self._head_dim = head_dim
     7         self._dropout_rate = dropout_rate
     8         self._masking = masking
     9         self._future = future
    10         self._trainable = trainable
    11         super(MultiHeadAttention, self).__init__(**kwargs)
    12 
    13     # 用方阵做Q,K,V的权重矩阵进行线性变换,维度不变
    14     def build(self, input_shape):
    15         self._weights_queries = self.add_weight(
    16             shape=(input_shape[0][-1], self._n_heads * self._head_dim),
    17             initializer='glorot_uniform',
    18             trainable=self._trainable,
    19             name='weights_queries')
    20         self._weights_keys = self.add_weight(
    21             shape=(input_shape[1][-1], self._n_heads * self._head_dim),
    22             initializer='glorot_uniform',
    23             trainable=self._trainable,
    24             name='weights_keys')
    25         self._weights_values = self.add_weight(
    26             shape=(input_shape[2][-1], self._n_heads * self._head_dim),
    27             initializer='glorot_uniform',
    28             trainable=self._trainable,
    29             name='weights_values')
    30         super(MultiHeadAttention, self).build(input_shape)
    31 
    32     def call(self, inputs, **kwargs):
    33         if self._masking:
    34             assert len(inputs) == 4, "inputs should be set [queries, keys, values, masks]."
    35             queries, keys, values, masks = inputs
    36         else:
    37             assert len(inputs) == 3, "inputs should be set [queries, keys, values]."
    38             queries, keys, values = inputs
    39 
    40         queries_linear = K.dot(queries, self._weights_queries)
    41         keys_linear = K.dot(keys, self._weights_keys)
    42         values_linear = K.dot(values, self._weights_values)
    43 
    44         # 将变换后的Q,K,V在embedding words的维度上进行切分
    45         queries_multi_heads = tf.concat(tf.split(queries_linear, self._n_heads, axis=2), axis=0)
    46         keys_multi_heads = tf.concat(tf.split(keys_linear, self._n_heads, axis=2), axis=0)
    47         values_multi_heads = tf.concat(tf.split(values_linear, self._n_heads, axis=2), axis=0)
    48 
    49         if self._masking:
    50             att_inputs = [queries_multi_heads, keys_multi_heads, values_multi_heads, masks]
    51         else:
    52             att_inputs = [queries_multi_heads, keys_multi_heads, values_multi_heads]
    53 
    54         attention = ScaledDotProductAttention(
    55             masking=self._masking, future=self._future, dropout_rate=self._dropout_rate)
    56         att_out = attention(att_inputs)
    57 
    58         outputs = tf.concat(tf.split(att_out, self._n_heads, axis=0), axis=2)
    59 
    60         return outputs
    61 
    62     def compute_output_shape(self, input_shape):
    63         return input_shape

    Position-wise Feed Forward代码实现:

     1 # Position-wise Feed Forward层
     2 # out = (relu(xW1+b1))W2+b2
     3 class PositionWiseFeedForward(Layer):
     4 
     5     def __init__(self, model_dim, inner_dim, trainable=True, **kwargs):
     6         self._model_dim = model_dim
     7         self._inner_dim = inner_dim
     8         self._trainable = trainable
     9         super(PositionWiseFeedForward, self).__init__(**kwargs)
    10 
    11     def build(self, input_shape):
    12         self.weights_inner = self.add_weight(
    13             shape=(input_shape[-1], self._inner_dim),
    14             initializer='glorot_uniform',
    15             trainable=self._trainable,
    16             name="weights_inner")
    17         self.weights_out = self.add_weight(
    18             shape=(self._inner_dim, self._model_dim),
    19             initializer='glorot_uniform',
    20             trainable=self._trainable,
    21             name="weights_out")
    22         self.bais_inner = self.add_weight(
    23             shape=(self._inner_dim,),
    24             initializer='uniform',
    25             trainable=self._trainable,
    26             name="bais_inner")
    27         self.bais_out = self.add_weight(
    28             shape=(self._model_dim,),
    29             initializer='uniform',
    30             trainable=self._trainable,
    31             name="bais_out")
    32         super(PositionWiseFeedForward, self).build(input_shape)
    33 
    34     def call(self, inputs, **kwargs):
    35         if K.dtype(inputs) != 'float32':
    36             inputs = K.cast(inputs, 'float32')
    37         inner_out = K.relu(K.dot(inputs, self.weights_inner) + self.bais_inner)
    38         outputs = K.dot(inner_out, self.weights_out) + self.bais_out
    39         return outputs
    40 
    41     def compute_output_shape(self, input_shape):
    42         return self._model_dim

    Normalization代码实现:

     1 # Normalization层
     2 class LayerNormalization(Layer):
     3 
     4     def __init__(self, epsilon=1e-8, **kwargs):
     5         self._epsilon = epsilon
     6         super(LayerNormalization, self).__init__(**kwargs)
     7 
     8     def build(self, input_shape):
     9         self.beta = self.add_weight(
    10             shape=(input_shape[-1],),
    11             initializer='zero',
    12             name='beta')
    13         self.gamma = self.add_weight(
    14             shape=(input_shape[-1],),
    15             initializer='one',
    16             name='gamma')
    17         super(LayerNormalization, self).build(input_shape)
    18 
    19     def call(self, inputs, **kwargs):
    20         mean, variance = tf.nn.moments(inputs, [-1], keepdims=True)
    21         normalized = (inputs - mean) / ((variance + self._epsilon) ** 0.5)
    22         outputs = self.gamma * normalized + self.beta
    23         return outputs
    24 
    25     def compute_output_shape(self, input_shape):
    26         return input_shape

    Transformer整体架构实现:

      1 class Transformer(Layer):
      2     def __init__(self, vocab_size, model_dim, n_heads=8, encoder_stack=6, decoder_stack=6, feed_forward_size=2048, dropout=0.1, **kwargs):
      3         self._vocab_size = vocab_size
      4         self._model_dim = model_dim
      5         self._n_heads = n_heads
      6         self._encoder_stack = encoder_stack
      7         self._decoder_stack = decoder_stack
      8         self._feed_forward_size = feed_forward_size
      9         self._dropout_rate = dropout
     10         super(Transformer, self).__init__(**kwargs)
     11 
     12     def build(self, input_shape):
     13         self.embeddings = self.add_weight(
     14             shape=(self._vocab_size, self._model_dim),
     15             initializer='glorot_uniform',
     16             trainable=True,
     17             name="embeddings")
     18         super(Transformer, self).build(input_shape)
     19 
     20     def encoder(self,inputs):
     21         if K.dtype(inputs) != 'int32':
     22             inputs = K.cast(inputs, 'int32')
     23 
     24         masks = K.equal(inputs,0)
     25         # Embeddings
     26         embeddings = Embedding(self._vocab_size,self._model_dim)(inputs)
     27         # Position Encodings
     28         position_encodings = PositionEncoding(self._model_dim)(embeddings)
     29         # Embeddings + Position Encodings
     30         encodings = embeddings + position_encodings
     31         # Dropout
     32         encodings = K.dropout(encodings,self._dropout_rate)
     33 
     34         # Encoder
     35         for i in range(self._encoder_stack):
     36             # Multi-head Attention
     37             attention = MultiHeadAttention(self._n_heads,self._model_dim // self._n_heads)
     38             attention_input = [encodings,encodings,encodings,masks]
     39             attention_out = attention(attention_input)
     40             # Add & Norm
     41             attention_out += encodings
     42             attention_out = LayerNormalization()(attention_out)
     43             # Feed-Forward
     44             pwff = PositionWiseFeedForward(self._model_dim,self._feed_forward_size)
     45             pwff_out = pwff(attention_out)
     46             # Add & Norm
     47             pwff_out += attention_out
     48             encodings = LayerNormalization()(pwff_out)
     49 
     50         return encodings,masks
     51 
     52     def decoder(self,inputs):
     53         decoder_inputs, encoder_encodings, encoder_masks = inputs
     54         if K.dtype(decoder_inputs) != 'int32':
     55             decoder_inputs = K.cast(decoder_inputs, 'int32')
     56         decoder_masks = K.equal(decoder_inputs,0)
     57         # Embeddings
     58         embeddings = Embedding(self._vocab_size,self._model_dim)(decoder_inputs)
     59         # Position Encodings
     60         position_encodings = PositionEncoding(self._model_dim)(embeddings)
     61         # Embeddings + Position Encodings
     62         encodings = embeddings + position_encodings
     63         # Dropout
     64         encodings = K.dropout(encodings,self._dropout_rate)
     65 
     66         for i in range(self._decoder_stack):
     67             # Masked-Multi-head-Attention
     68             masked_attention = MultiHeadAttention(self._n_heads, self._model_dim // self._n_heads, future=True)
     69             masked_attention_input = [encodings, encodings, encodings, decoder_masks]
     70             masked_attention_out = masked_attention(masked_attention_input)
     71             # Add & Norm
     72             masked_attention_out += encodings
     73             masked_attention_out = LayerNormalization()(masked_attention_out)
     74 
     75             # Multi-head-Attention
     76             attention = MultiHeadAttention(self._n_heads, self._model_dim // self._n_heads)
     77             attention_input = [masked_attention_out, encoder_encodings, encoder_encodings, encoder_masks]
     78             attention_out = attention(attention_input)
     79             # Add & Norm
     80             attention_out += masked_attention_out
     81             attention_out = LayerNormalization()(attention_out)
     82 
     83             # Feed-Forward
     84             pwff = PositionWiseFeedForward(self._model_dim, self._feed_forward_size)
     85             pwff_out = pwff(attention_out)
     86             # Add & Norm
     87             pwff_out += attention_out
     88             encodings = LayerNormalization()(pwff_out)
     89 
     90         # Pre-Softmax 与 Embeddings 共享参数
     91         linear_projection = K.dot(encodings, K.transpose(self.embeddings))
     92         outputs = K.softmax(linear_projection)
     93         return outputs
     94 
     95     def call(self, inputs, **kwargs):
     96         encoder_inputs, decoder_inputs = inputs
     97         encoder_encodings, encoder_masks = self.encoder(encoder_inputs)
     98         encoder_outputs = self.decoder([decoder_inputs, encoder_encodings, encoder_masks])
     99         return encoder_outputs
    100 
    101     def compute_output_shape(self, input_shape):
    102         return (input_shape[0][0], input_shape[0][1], self._vocab_size)

    下一篇将使用transformer构建BERT网络进行文本情感分类实战。

  • 相关阅读:
    loj#2333 「JOI 2017 Final」准高速电车
    loj#2332 「JOI 2017 Final」焚风现象
    loj#501 「LibreOJ β Round」ZQC 的树列
    loj#500 「LibreOJ β Round」ZQC 的拼图
    p2827 蚯蚓
    p5471 [NOI2019]弹跳
    p5304 [GXOI/GZOI2019]旅行者
    p2503 [HAOI2006]均分数据
    p4899 [IOI2018] werewolf 狼人
    loj#137 最小瓶颈路 加强版
  • 原文地址:https://www.cnblogs.com/zdm-code/p/14641862.html
Copyright © 2011-2022 走看看