Attention is all you need

zoukankan html css js c++ java

Attention is all you need
Attention is all you need

the Transformer,based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

transformer只依靠attention机制，舍弃了之前的rnn和cnn的结构

Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

非常的消耗算力，因此后面的很多学者研究模型压缩

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states (h_{t},) as a function of the previous hidden state (h_{t-1}) and the input for position (t .) This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

RNN的(h_t)是同时接受(x_t)和(h_{t-1})的影响的

但是RNN相关算法只能从左向右依次计算或者从右向左依次计算缺少全局的依赖
但是还是短距离依赖，没法解决梯度消失，长距离依赖的问题
因此出现了lstm和gru

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences .

attention在序列模型传导机制中允许对依赖项进行建模而无需考虑它们之间的输入距离或输出序列

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU |16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions ([12] .) In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention

前人的研究有使用卷积神经进行序列建模建立block结构（卷积核？）并行计算所有输入和输出位置的隐藏表示，在这些模型中，关联来自两个任意输入或输出位置的信号所需的操作数随位置之间的距离而增加，对于ConvS2S的参数呈线性增长，而对于ByteNet参数则对数增长。这使得学习远位置之间的依赖关系变得更加困难。

在Transformer中，讲参数减少到一个固定的维度，尽管这是由于平均注意力加权位置而导致有效分辨率降低的结果，可以使用多头注意力抵消这种影响~

所以多头注意力机制是为了限制参数增长的？解决这个问题之前先知道cnn是怎样让参数爆炸的？

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

自我注意（有时称为内部注意）是一种与单个序列的不同位置相关的注意力机制，目的是计算序列的表示形式。

这里看下之前的注意力机制的讲解attention

Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution.

Transformer是第一个完全依靠自我注意力来计算其输入和输出表示的转导模型，而无需使用序列对齐的RNN或卷积

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,respectively.

下面介绍的transfomer都是基于自注意力elf-attention和 point-wise（计算注意力时候用的是点积的形式:可以理解为逐点扫描就像kernel size为1的卷积操作,对输出的每一个位置做同样的变化?)

transformer的结构也是由encoder和decoder组成

encoder: The encoder is composed of a stack of (N=6) identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm ((x+) Sublayer ((x)),) where Sublayer ((x)) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension (d_{ ext {model }}=512).

encoder部分是由6个相同的堆网络层组成的，每一层有2个子网络：第一个子网络是多头注意力和自注意力机制，第二个子网络是一个位置全连接前馈神经网络（这个咋理解？）

在两个子网络周围用残差网络连接（也就是没两个子网络之间用到了残差网络）LayerNorm（这个需要查下：LN中同层神经元输入拥有相同的均值和方差，不同的输入样本有不同的均值和方差；BN中则针对不同神经元输入计算均值和方差，同一个batch中的输入拥有相同的均值和方差。所以，LN不依赖于batch的大小和输入sequence的深度，因此可以用于batchsize为1和RNN中对边长的输入sequence的normalize操作。）然后进行图1画的还是很直观的。

因此每个子层的输出就是正则化后的((x+) Sublayer ((x)),)sub-layers和embedding layers的输出维度设置为512，这样是为了更好的进行残差连接（补下残差连接)

Decoder: The decoder is also composed of a stack of (N=6) identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.
decoder部分也是由6个相同的块结构组成，除了每个编码器层中的两个子层之外，解码器还插入一个第三子层，该子层对编码器堆栈的输出执行多头关注，在每个sub-layers之间同样使用了残差神经网络。

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position (i) can depend only on the known outputs at positions less than (i).

修改了解码器堆栈中的自我注意子层，以防止位置关注后续位置。这种掩盖，加上输出嵌入被一个位置偏移的事实，确保了对位置$ i (的预测只能依赖于位置小于) i $的已知输出。(这里感觉用到了HMM的齐次一阶马尔可夫？）

An attention function can be described as mapping a query and a set of key-value pairs to an output,where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the
query with the corresponding key

attention可以描述为将查询和一组键值对映射到输出，其中查询，键，值和输出都是向量。 将输出计算为值的加权总和，其中分配给每个值的权重是通过查询与相应键的兼容性函数来计算

We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension (d_{k},) and values of dimension (d_{v} .) We compute the dot products of the query with all keys, divide each by (sqrt{d_{k}}), and apply a softmax function to obtain the weights on the values.

Scaled Dot-Product Attention:输入是(d_{k},)维的键值和(d_{v} .)维的值，用所有的k计算查询的点积，将每个QK的除以$ sqrt {d_ {k}} $，然后应用softmax函数来获得值的权重。

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix (Q .) The keys and values are also packed together into matrices (K) and (V). We compute the matrix of outputs as:

查询,键值分别对应Q,K,V三个矩阵，因此attention的矩阵运算如下

[operatorname{Attention}(Q, K, V)=operatorname{softmax}left(frac{Q K^{T}}{sqrt{d_{k}}} ight) V ]

Transformer会在三个地方使用multi-head attention：
1. encoder-decoder attention：输入为encoder的输出和decoder的self-attention输出，其中encoder的self-attention作为 key and value，decoder的self-attention作为query。
2. encoder self-attention：输入的Q、K、V都是encoder的input embedding and positional embedding。
3. decoder self-attention：在decoder的self-attention层中，deocder 都能够访问当前位置前面的位置，输入的Q、K、V都是decoder的input embedding and positional embedding。
  **Note: **在一般的attention模型中，Q就是decoder的隐层，K就是encoder的隐层，V也是encoder的隐层。所谓的self-attention就是取Q，K，V相同，均为encoder或者decoder的input embedding and positional embedding，更具体为“网络输入是三个相同的向量q, k和v，是word embedding和position embedding相加得到的结果"。csdn
The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of (frac{1}{sqrt{d_{k}}} .) Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

计算attention的方式有2种，一种是点积的形式，另一种是求和的形式这里可以看下参考文献2，transformer中用的是点积的形式，此外还多了一个标准化的(frac{1}{sqrt{d_{k}}} .)
求和形式的注意力使用具有单个隐藏层的前馈网络来计算兼容性函数.实际中点积形式的会更快更省内存

While for small values of (d_{k}) the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of (d_{k}[3] .) We suspect that for large values of (d_{k},) the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients ({ }^{4} .) To counteract this effect, we scale the dot products by (frac{1}{sqrt{d_{k}}})

虽然对于$ d_ {k} (较小的，这两种机制的执行方式相似，但是对于) d_ {k} 较大的，加法注意的性能优于点积注意，而无需缩放。(我们怀疑对于) d_ {k的较大值， }，(点积的幅度增大，将softmax函数推入梯度极小的区域) {} ^ {4}。(为了抵消这种影响，我们用) frac {1} { sqrt {d_ {k}}} $

这里不知所云？

Multi-Head Attention:Instead of performing a single attention function with (d_{ ext {model }}) -dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values (h) times with different, learned linear projections to (d_{k}, d_{k}) and (d_{v}) dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding (d_{v}) -dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2 .

与使用(d _ { text {model}})维的键，值和查询执行单个注意功能相比，multi-head attention则是通过h个不同的线性变换对Q，K，V进行投影，最后将不同的attention结果拼接起来再次训练，有点像cnn有咩有。。

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

多头注意力使模型共同关注来自不同位置的不同表示子空间的信息。对于一个注意力集中的头部，平均抑制了这一点。

[egin{aligned} operatorname{MultiHead}(Q, K, V) &=operatorname{Concat}left(operatorname{head}_{1}, ldots, operatorname{head}_{mathrm{h}} ight) W^{O} \ ext { where head }_{mathrm{i}} &= ext { Attention }left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V} ight) end{aligned} ]

Where the projections are parameter matrices (W_{i}^{Q} in mathbb{R}^{d_{ ext {model }} imes d_{k}}, W_{i}^{K} in mathbb{R}^{d_{ ext {model }} imes d_{k}}, W_{i}^{V} in mathbb{R}^{d_{ ext {model }} imes d_{v}}) and (W^{O} in mathbb{R}^{h d_{v} imes d_{ ext {model }}})

In this work we employ (h=8) parallel attention layers, or heads. For each of these we use (d_{k}=d_{v}=d_{ ext {model }} / h=64 .) Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

这里multi-head的头部个数8

Position-wise Feed-Forward Networks：In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

Position-wise Feed-Forward Networks：位置全连接前馈神经网络

[operatorname{FFN}(x)=max left(0, x W_{1}+b_{1} ight) W_{2}+b_{2} ]

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1 The dimensionality of input and output is (d_{ ext {model }}=512,) and the inner-layer has dimensionality (d_{f f}=2048)

位置全链接前贵网络一一MLP变形。之所以是position-wise (i/o维度一样) 是因为处理的attention输出是某一个位置i的attention输出。hidden_size变化为：768->3072->768（或者512->2048->512）。

Position-wise feed forward network其实就是一个MLP 网络(多层感知机）, i的输出中, 每个(d_model)维向量 x 在此先由 (mathrm{xW}_{-} 1+mathrm{b}_{-} 1) 变为 (mathrm{d}_{1}) 维的 (mathrm{x}^{prime},) 再经过max (left(0, mathrm{x}^{prime} ight) mathrm{W}_{2} +mathrm{b}_{-2} 2) 回归 (mathrm{d}_{model}) 维。
Feed Forward Neural Network全连接有两层dense, 第一层的激活函数是ReLU(或者其更平滑的版本Gaussian Error Linear Unit-gelu), 第二层是一个线性激活函数, 如果multi-head输出表示为Z, 则FFN可以表示为：

[mathrm{FFN}(Z)=max left(0, Z W_{1}+b_{1} ight) W_{2}+b_{2} ]
之后就是对hidden层进行dropout, 最后加一个resnet并normalization（tensor的最后一维, 即feature维进行）。
Transformer通过对输入的文本不断进行这样的注意力机制层和普通的非线性层交叠来得到最终的文本表达。csdn

那这样我就明白了，也就是input是经过attention层和普通的全连接层（使用的激活函数是relu）

Embeddings and Softmax:Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension (d_{ ext {model. }}). We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by (sqrt{d_{ ext {model }}}).

有embedding层，decoder到output时用到了线性转移和softmax，在模型里面embedding层是共享参数的

Positional Encoding:Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed.

Transformer抛弃了RNN，而RNN最大的优点就是在时间序列上对数据的抽象，所以文章中作者提出两种Positional Encoding的方法，将Positional Encoding后的数据与输入embedding数据求和，加入了相对位置信息。

两种Positional Encoding方法：
- 用不同频率的sine和cosine函数直接计算
- 学习出一份positional embedding。学习时注意，每个batch的pos emb都一样，即在batch维度进行broadcast。
  经过实验发现两者的结果一样，所以最后选择了第一种方法，公式如下：
[egin{aligned} P E_{(p o s, 2 i)} &=sin left(p o s / 10000^{2 i / d_{ ext {model }}} ight) \ P E_{(p o s, 2 i+1)} &=cos left(p o s / 10000^{2 i / d_{ ext {model }}} ight) end{aligned}]

任意位置的 $PE_{pos+k} $都可以被 $PE_{pos} $的线性函数表示。考虑到在NLP任务中，除了单词的绝对位置，单词的相对位置也非常重要。根据公式 (sin(alpha+eta) = sin alpha cos eta + cos alpha sineta 以及cos(alpha + eta) = cos alpha cos eta - sin alpha sineta，)这表明位置 $k+p $的位置向量可以表示为位置 k 的特征向量的线性变化，这为模型捕捉单词之间的相对位置关系提供了非常大的便利。
1. 如果是学习到的positional embedding，可能会像词向量一样受限于词典大小。也就是只能学习到“位置2对应的向量是(1,1,1,2)”这样的表示。所以用三角公式明显不受序列长度的限制，也就是可以对比所遇到序列的更长的序列进行表示。
Transformer注意力机制有效的解释：Transformer所使用的注意力机制的核心思想是去计算一句话中的每个词对于这句话中所有词的相互关系，然后认为这些词与词之间的相互关系在一定程度上反应了这句话中不同词之间的关联性以及重要程度。因此再利用这些相互关系来调整每个词的重要性（权重）就可以获得每个词新的表达。这个新的表征不但蕴含了该词本身，还蕴含了其他词与这个词的关系，因此和单纯的词向量相比是一个更加全局的表达。使用了Attention机制，将序列中的任意两个位置之间的距离缩小为一个常量。
Attention之后还有一个线性的dense层，即multi-head attention_output经过一个hidden_size为768的dense层，然后对hidden层进行dropout，最后加上resnet并进行normalization（tensor的最后一维，即feature维进行）。

总结的很到位
查看全文

相关阅读:
mysql合并数据
 java协变类型返回
 OSI网络七层模型理解
 mysql性能优化学习
 redis lock 和 tryLock 实际使用区别
 多字段关联同一张表
 第一个Mabits程序
 Mybatis使用Map来实现传递多个参数及Mybati实现模糊查询
 使用Mybatis框架的步骤
 sql小技巧

原文地址：https://www.cnblogs.com/gaowenxingxing/p/13501614.html