encoding
Sinusoidal Position Encoding
[egin{aligned}
P E_{(p o s, 2 i)} &=sin left(frac{p o s}{10000^{frac{2 i}{d_{ ext {model}}}}}
ight) \
P E_{(p o s, 2 i+1)} &=cos left(frac{p o s}{10000^{frac{2 i}{d_{ ext {model}}}}}
ight)
end{aligned}
]
pos + k 位置的encoding可以通过pos位置的encoding线性表示。它们的关系可以通过三角函数公式体现:
[egin{array}{l}
sin (alpha+eta)=sin alpha cdot cos eta+cos alpha cdot sin eta \
cos (alpha+eta)=cos alpha cdot cos eta-sin alpha cdot sin eta
end{array}
]
位置为 pos + k 的positional encoding 可以表示如下:
[egin{array}{l}
P E_{(p o s+k, 2 i)}=sin left(w_{i} cdot(p o s+k)
ight)=sin left(w_{i} p o s
ight) cos left(w_{i} k
ight)+cos left(w_{i} p o s
ight) sin left(w_{i} k
ight) \
P E_{(p o s+k, 2 i+1)}=cos left(w_{i} cdot(p o s+k)
ight)=cos left(w_{i} p o s
ight) cos left(w_{i} k
ight)-sin left(w_{i} p o s
ight) sin left(w_{i} k
ight)
end{array} \
w_{i}=frac{1}{10000^{2 i / d_{ ext {model}}}}
]
化简如下:
[egin{aligned}
P E_{(p o s+k, 2 i)} &=cos left(w_{i} k
ight) P E_{(p o s, 2 i)}+sin left(w_{i} k
ight) P E_{(p o s, 2 i+1)} \
P E_{(p o s+k, 2 i+1)} &left.=cos left(w_{i} k
ight) P E_{(p o s, 2 i+1)}-sin left(w_{i} k
ight) P E_{(p o s, 2 i)}
ight)
end{aligned}
]
其中与k相关的项都是常数,所以 (PE_{pos+k}) 可以被 (PE_{pos}) 线性表示。
由于
[P E_{(p o s, 2 i)} =sin left(pos cdot frac{1}{10000^{frac{2 i}{d_{ ext {model}}}}}
ight) \
T = 2 pi cdot 10000^{frac{2i}{d_model}}
]
所以i越大,周期就越大。周期的范围从 (2 pi) 到 (2 pi cdot 10000)
Bert 中的 positional encoding
源码:
class BertEmbeddings(nn.Module):
def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id) # (vocab_size, hidden_size)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) # (512, hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size) # (2, hidden_size)
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
Bert 中的embedding是用三个embedding加起来的, positional encoding 也没有采用transformer中的三角函数,而是通过Embedding层训练得到。