zoukankan      html  css  js  c++  java
  • 谷歌BERT预训练源码解析(二):模型构建

    目录
    前言
    源码解析
    模型配置参数
    BertModel
    word embedding
    embedding_postprocessor
    Transformer
    self_attention
    模型应用
    前言
    BERT的模型主要是基于Transformer架构(论文:Attention is all you need)。它抛开了RNN等固有模式,直接用注意力机制处理Seq2Seq问题,体现了大道至简的思想。网上对此模型解析的资料有很多,但大都千篇一律。这里推荐知乎的一篇《Attention is all you need》解读,我觉得这篇把transformer介绍的非常好。
    由于模型最闹心的就是维度问题,维度理清了,理解模型就很容易,所以我在源码中会注释每个操作后tensor的维度信息。
    下面开始介绍BERT的模型 modeling.py是怎么建立的,我始终认为读代码和注释是理解的最快方法,所以看代码时如果官方注释有的地方看不懂。请善看中文注释和维度信息

    源码解析
    模型配置参数
    " attention_probs_dropout_prob": 0.1, #乘法attention时,softmax后dropout概率
    "hidden_act": "gelu", #激活函数
    "hidden_dropout_prob": 0.1, #隐藏层dropout概率
    "hidden_size": 768, #隐藏单元数
    "initializer_range": 0.02, #初始化范围
    "intermediate_size": 3072, #升维维度
    "max_position_embeddings": 512, #一个大于seq_length的参数,用于生成position_embedding
    "num_attention_heads": 12, #每个隐藏层中的attention head数
    "num_hidden_layers": 12, #隐藏层数
    "type_vocab_size": 2, #segment_ids类别 [0,1]
    "vocab_size": 30522 #词典中词数
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    这里的输入参数:input_ids,input_mask,token_type_ids对应上篇文章中输出的input_ids,input_mask,segment_ids

    BertModel
    这部分是总流程,整个modling脚本有900多行代码,所以我列个流程图一部一部走。整体流程如下。首先对input_ids和token_type_ids进行embedding操作,将embedding结果送入Transformer训练,最后得到编码结果。


    def __init__(self,
    config,
    is_training,
    input_ids,
    input_mask=None,
    token_type_ids=None,
    use_one_hot_embeddings=True,
    scope=None):
    """Constructor for BertModel.
    Args:
    config: `BertConfig` instance.
    is_training: bool. rue for training model, false for eval model. Controls
    whether dropout will be applied.
    input_ids: int32 Tensor of shape [batch_size, seq_length].
    input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
    use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
    embeddings or tf.embedding_lookup() for the word embeddings. On the TPU,
    it is must faster if this is True, on the CPU or GPU, it is faster if
    this is False.
    scope: (optional) variable scope. Defaults to "bert".
    Raises:
    ValueError: The config is invalid or one of the input tensor shapes
    is invalid.
    """
    config = copy.deepcopy(config)
    if not is_training:
    config.hidden_dropout_prob = 0.0
    config.attention_probs_dropout_prob = 0.0

    input_shape = get_shape_list(input_ids, expected_rank=2)
    batch_size = input_shape[0]
    seq_length = input_shape[1]

    if input_mask is None:
    input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

    if token_type_ids is None:
    token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

    with tf.variable_scope(scope, default_name="bert"):
    with tf.variable_scope("embeddings"):
    # Perform embedding lookup on the word ids.

    #[batch_size,seq_length,embedding_size] [vocab_size,embedding_size]
    (self.embedding_output, self.embedding_table) = embedding_lookup( #word_embedding
    input_ids=input_ids, #[batch_size,seq_length]
    vocab_size=config.vocab_size,
    embedding_size=config.hidden_size,
    initializer_range=config.initializer_range,
    word_embedding_name="word_embeddings",
    use_one_hot_embeddings=use_one_hot_embeddings)

    # Add positional embeddings and token type embeddings, then layer
    # normalize and perform dropout.
    self.embedding_output = embedding_postprocessor( #token_embedding和position_embedding [batch_size,seq_length,embedding_size]
    input_tensor=self.embedding_output,
    use_token_type=True,
    token_type_ids=token_type_ids,
    token_type_vocab_size=config.type_vocab_size,
    token_type_embedding_name="token_type_embeddings",
    use_position_embeddings=True,
    position_embedding_name="position_embeddings",
    initializer_range=config.initializer_range,
    max_position_embeddings=config.max_position_embeddings,
    dropout_prob=config.hidden_dropout_prob)

    with tf.variable_scope("encoder"):
    # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
    # mask of shape [batch_size, seq_length, seq_length] which is used
    # for the attention scores.
    attention_mask = create_attention_mask_from_input_mask(
    input_ids, input_mask)

    # Run the stacked transformer.
    # `sequence_output` shape = [batch_size, seq_length, hidden_size].
    self.all_encoder_layers = transformer_model( #transformer_model list(#[batch_size,seq_length,embedding_size])
    input_tensor=self.embedding_output,
    attention_mask=attention_mask,
    hidden_size=config.hidden_size,
    num_hidden_layers=config.num_hidden_layers,
    num_attention_heads=config.num_attention_heads,
    intermediate_size=config.intermediate_size,
    intermediate_act_fn=get_activation(config.hidden_act),
    hidden_dropout_prob=config.hidden_dropout_prob,
    attention_probs_dropout_prob=config.attention_probs_dropout_prob,
    initializer_range=config.initializer_range,
    do_return_all_layers=True)

    self.sequence_output = self.all_encoder_layers[-1] #获取最后一层的输出
    # The "pooler" converts the encoded sequence tensor of shape
    # [batch_size, seq_length, hidden_size] to a tensor of shape
    # [batch_size, hidden_size]. This is necessary for segment-level
    # (or segment-pair-level) classification tasks where we need a fixed
    # dimensional representation of the segment.
    with tf.variable_scope("pooler"):
    # We "pool" the model by simply taking the hidden state corresponding
    # to the first token. We assume that this has been pre-trained
    first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1) #取每个每个训练语料的第一个词的编码结果[CLS],它有整条训练语料的编码信息 [batch_size, hidden_size]
    self.pooled_output = tf.layers.dense( #接一个全连接层进行输出 [batch_size, hidden_size]
    first_token_tensor,
    config.hidden_size,
    activation=tf.tanh,
    kernel_initializer=create_initializer(config.initializer_range))
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    word embedding
    首先看word_embedding部分,它传入input_ids,运用one_hot为中介返回embedding结果

    def embedding_lookup(input_ids,
    vocab_size,
    embedding_size=128,
    initializer_range=0.02,
    word_embedding_name="word_embeddings",
    use_one_hot_embeddings=False):
    """Looks up words embeddings for id tensor.
    Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
    ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
    embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better
    for TPUs.
    Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
    """
    # This function assumes that the input is of shape [batch_size, seq_length,
    # num_inputs].
    #
    # If the input is a 2D tensor of shape [batch_size, seq_length], we
    # reshape to [batch_size, seq_length, 1].
    if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1]) #最低维扩维 [batch_size,seq_length,1]

    embedding_table = tf.get_variable(
    name=word_embedding_name,
    shape=[vocab_size, embedding_size],
    initializer=create_initializer(initializer_range))

    if use_one_hot_embeddings:
    flat_input_ids = tf.reshape(input_ids, [-1]) #[batch_size*seq_length]
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) #[batch_size*seq_length,vocab_size]
    output = tf.matmul(one_hot_input_ids, embedding_table) #[batch_size*seq_length,embedding_size]
    else:
    output = tf.nn.embedding_lookup(embedding_table, input_ids)

    input_shape = get_shape_list(input_ids)

    output = tf.reshape(output,
    input_shape[0:-1] + [input_shape[-1] * embedding_size]) #[batch_size,seq_length,embedding_size]
    return (output, embedding_table)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    embedding_postprocessor
    再看embedding_postprocessor 它包括token_type_embedding和position_embedding。也就是图中的Segement Embeddings和Position Embeddings。

    但此代码中Position Embeddings部分与之前提出的Transformer不同,此代码中Position Embeddings是训练出来的,而传统的Transformer(如下)是固定值


    def embedding_postprocessor(input_tensor, #[batch_size,seq_length,embedding_size]
    use_token_type=False,
    token_type_ids=None, #[batch_size,seq_length]
    token_type_vocab_size=16,
    token_type_embedding_name="token_type_embeddings",
    use_position_embeddings=True,
    position_embedding_name="position_embeddings",
    initializer_range=0.02,
    max_position_embeddings=512,
    dropout_prob=0.1):
    """Performs various post-processing on a word embedding tensor.
    Args:
    input_tensor: float Tensor of shape [batch_size, seq_length,
    embedding_size].
    use_token_type: bool. Whether to add embeddings for `token_type_ids`.
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
    Must be specified if `use_token_type` is True.
    token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
    token_type_embedding_name: string. The name of the embedding table variable
    for token type ids.
    use_position_embeddings: bool. Whether to add position embeddings for the
    position of each token in the sequence.
    position_embedding_name: string. The name of the embedding table variable
    for positional embeddings.
    initializer_range: float. Range of the weight initialization.
    max_position_embeddings: int. Maximum sequence length that might ever be
    used with this model. This can be longer than the sequence length of
    input_tensor, but cannot be shorter.
    dropout_prob: float. Dropout probability applied to the final output tensor.
    Returns:
    float tensor with same shape as `input_tensor`.
    Raises:
    ValueError: One of the tensor shapes or input values is invalid.
    """
    input_shape = get_shape_list(input_tensor, expected_rank=3)
    batch_size = input_shape[0]
    seq_length = input_shape[1]
    width = input_shape[2]

    output = input_tensor

    if use_token_type: #Segement Embeddings部分
    if token_type_ids is None:
    raise ValueError("`token_type_ids` must be specified if"
    "`use_token_type` is True.")
    token_type_table = tf.get_variable(
    name=token_type_embedding_name,
    shape=[token_type_vocab_size, width],
    initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1]) #[batch_size*seq_length]
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size) #[batch_size*seq_length,2] token_type只有0,1
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table) #[batch_size*seq_length,embedding_size]
    token_type_embeddings = tf.reshape(token_type_embeddings,
    [batch_size, seq_length, width]) #[batch_size, seq_length, width=embedding_size]
    output += token_type_embeddings #[batch_size, seq_length, embedding_size]

    if use_position_embeddings: #Position Embeddings部分
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings) #确保seq_length<max_position_embedding
    with tf.control_dependencies([assert_op]):
    full_position_embeddings = tf.get_variable(
    name=position_embedding_name,
    shape=[max_position_embeddings, width],
    initializer=create_initializer(initializer_range))
    # Since the position embedding table is a learned variable, we create it
    # using a (long) sequence length `max_position_embeddings`. The actual
    # sequence length might be shorter than this, for faster training of
    # tasks that do not have long sequences.
    #
    # So `full_position_embeddings` is effectively an embedding table
    # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
    # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
    # perform a slice.
    position_embeddings = tf.slice(full_position_embeddings, [0, 0], #[seq_length,embedding_size]
    [seq_length, -1])
    num_dims = len(output.shape.as_list())

    # Only the last two dimensions are relevant (`seq_length` and `width`), so
    # we broadcast among the first dimensions, which is typically just
    # the batch size.
    position_broadcast_shape = []
    for _ in range(num_dims - 2):
    position_broadcast_shape.append(1)
    position_broadcast_shape.extend([seq_length, width]) #[1,seq_length,embedding_size]
    position_embeddings = tf.reshape(position_embeddings, #[1,seq_length,embedding_size]
    position_broadcast_shape)
    output += position_embeddings #[batch_size, seq_length, embedding_size] 与#[1,seq_length,embedding_size]相加
    #因为每一个batch的同一位置的position_embedding是一样的,所以相当于batch_size个position_embeddings与output相加

    output = layer_norm_and_dropout(output, dropout_prob)
    return output
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    Transformer
    embedding之后,首先构造一个attention_mask,这个attention_mask表示的含义是将原来的input_mask的[batch_size,seq_length]扩维到[batch_size,from_seq_length,to_seq_length]。保证对于每个from_seq_length都有一个input_mask。之后将他们传入到transformer模型。
    transformer整体架构如图所示

    下面我们来看transformer_model。首先对embedding进行multi-head attention,对输入进行残差和layer_norm。后传入feed forward,再进行残差和layer_norm。
    本块代码中与原论文中不一样的点为:在进行multi-head attention后先链接了一个全连接层,再进行的残差和layer_norm。而原论文中貌似没有那个全连接层。下面是代码,关键部分我已写上注释

    def transformer_model(input_tensor,
    attention_mask=None, #[batch_size,form_seq_length,to_seq_length]
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
    intermediate_act_fn=gelu,
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
    initializer_range=0.02,
    do_return_all_layers=False):
    """Multi-headed, multi-layer Transformer from "Attention is All You Need".
    This is almost an exact implementation of the original Transformer encoder.
    See the original paper:
    https://arxiv.org/abs/1706.03762
    Also see:
    https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
    Args:
    input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
    attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
    seq_length], with 1 for positions that can be attended to and 0 in
    positions that should not be.
    hidden_size: int. Hidden size of the Transformer.
    num_hidden_layers: int. Number of layers (blocks) in the Transformer.
    num_attention_heads: int. Number of attention heads in the Transformer.
    intermediate_size: int. The size of the "intermediate" (a.k.a., feed
    forward) layer.
    intermediate_act_fn: function. The non-linear activation function to apply
    to the output of the intermediate/feed-forward layer.
    hidden_dropout_prob: float. Dropout probability for the hidden layers.
    attention_probs_dropout_prob: float. Dropout probability of the attention
    probabilities.
    initializer_range: float. Range of the initializer (stddev of truncated
    normal).
    do_return_all_layers: Whether to also return all layers or just the final
    layer.
    Returns:
    float Tensor of shape [batch_size, seq_length, hidden_size], the final
    hidden layer of the Transformer.
    Raises:
    ValueError: A Tensor shape or parameter is invalid.
    """
    if hidden_size % num_attention_heads != 0:
    raise ValueError(
    "The hidden size (%d) is not a multiple of the number of attention "
    "heads (%d)" % (hidden_size, num_attention_heads))

    attention_head_size = int(hidden_size / num_attention_heads)
    input_shape = get_shape_list(input_tensor, expected_rank=3)
    batch_size = input_shape[0]
    seq_length = input_shape[1]
    input_width = input_shape[2]

    # The Transformer performs sum residuals on all layers so the input needs
    # to be the same as the hidden size.
    if input_width != hidden_size:
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
    (input_width, hidden_size))

    # We keep the representation as a 2D tensor to avoid re-shaping it back and
    # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
    # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
    # help the optimizer.
    prev_output = reshape_to_matrix(input_tensor) #这里官方说为了避免来回升降维,所以直接先变形为2D,最后再恢复成3D [batch_size*seq_length,hidden_size]

    all_layer_outputs = []
    for layer_idx in range(num_hidden_layers):
    with tf.variable_scope("layer_%d" % layer_idx):
    layer_input = prev_output

    with tf.variable_scope("attention"):
    attention_heads = []
    with tf.variable_scope("self"):
    attention_head = attention_layer( #进行self_attention 即multi-head attention
    from_tensor=layer_input, #[batch_size*seq_length,hidden_size]
    to_tensor=layer_input, #[batch_size*seq_length,hidden_size]
    attention_mask=attention_mask,
    num_attention_heads=num_attention_heads,
    size_per_head=attention_head_size,
    attention_probs_dropout_prob=attention_probs_dropout_prob,
    initializer_range=initializer_range,
    do_return_2d_tensor=True,
    batch_size=batch_size,
    from_seq_length=seq_length,
    to_seq_length=seq_length)
    attention_heads.append(attention_head)

    attention_output = None
    if len(attention_heads) == 1:
    attention_output = attention_heads[0]
    else:
    # In the case where we have other sequences, we just concatenate
    # them to the self-attention head before the projection.
    attention_output = tf.concat(attention_heads, axis=-1)

    # Run a linear projection of `hidden_size` then add a residual
    # with `layer_input`.
    with tf.variable_scope("output"):
    attention_output = tf.layers.dense( #对attention的输出做一个全连接层
    attention_output,
    hidden_size,
    kernel_initializer=create_initializer(initializer_range))
    attention_output = dropout(attention_output, hidden_dropout_prob)
    attention_output = layer_norm(attention_output + layer_input) #残差和layer_norm
    #Feed Foward过程,先对输出升维、再进行降维
    # The activation is only applied to the "intermediate" hidden layer.
    with tf.variable_scope("intermediate"):
    intermediate_output = tf.layers.dense( #升维
    attention_output,
    intermediate_size,
    activation=intermediate_act_fn,
    kernel_initializer=create_initializer(initializer_range))

    # Down-project back to `hidden_size` then add the residual.
    with tf.variable_scope("output"): #降维
    layer_output = tf.layers.dense(
    intermediate_output,
    hidden_size,
    kernel_initializer=create_initializer(initializer_range))
    layer_output = dropout(layer_output, hidden_dropout_prob)
    layer_output = layer_norm(layer_output + attention_output) #加入残差
    prev_output = layer_output #本层输出作为下一层输入
    all_layer_outputs.append(layer_output) #所有层的输出结果列表

    if do_return_all_layers:
    final_outputs = []
    for layer_output in all_layer_outputs:
    final_output = reshape_from_matrix(layer_output, input_shape)
    final_outputs.append(final_output)
    return final_outputs
    else:
    final_output = reshape_from_matrix(prev_output, input_shape)
    return final_output
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    self_attention
    接下来介绍self_attention机制。他运用乘法注意力,自己和自己做attention,使每个词都全局语义信息。同时运用Multi-head attention。即将hidden_size平分为多个部分(head)。每个head进行self_attention。不同head学习不同子空间语义。

    下面是代码,关键部分我已写上注释。首先将输入的key和value,reshape成[batch_size,num_head,seq_length,size_per_head]。在对这些head进行乘法注意力运算。经过softmax后乘以value。最后返回tensor with shape [batch_size*seq_length,hidden_size]

    def attention_layer(from_tensor, #from_tensor和to_tensor都是输入embedding [batch_size*seq_length,hidden_size]
    to_tensor,
    attention_mask=None, #[batch_size,form_seq_length,to_seq_length]
    num_attention_heads=1,
    size_per_head=512,
    query_act=None,
    key_act=None,
    value_act=None,
    attention_probs_dropout_prob=0.0,
    initializer_range=0.02,
    do_return_2d_tensor=False,
    batch_size=None,
    from_seq_length=None,
    to_seq_length=None):
    """Performs multi-headed attention from `from_tensor` to `to_tensor`.
    This is an implementation of multi-headed attention based on "Attention
    is all you Need". If `from_tensor` and `to_tensor` are the same, then
    this is self-attention. Each timestep in `from_tensor` attends to the
    corresponding sequence in `to_tensor`, and returns a fixed-with vector.
    This function first projects `from_tensor` into a "query" tensor and
    `to_tensor` into "key" and "value" tensors. These are (effectively) a list
    of tensors of length `num_attention_heads`, where each tensor is of shape
    [batch_size, seq_length, size_per_head].
    Then, the query and key tensors are dot-producted and scaled. These are
    softmaxed to obtain attention probabilities. The value tensors are then
    interpolated by these probabilities, then concatenated back to a single
    tensor and returned.
    In practice, the multi-headed attention are done with transposes and
    reshapes rather than actual separate tensors.
    Args:
    from_tensor: float Tensor of shape [batch_size, from_seq_length,
    from_width].
    to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
    attention_mask: (optional) int32 Tensor of shape [batch_size,
    from_seq_length, to_seq_length]. The values should be 1 or 0. The
    attention scores will effectively be set to -infinity for any positions in
    the mask that are 0, and will be unchanged for positions that are 1.
    num_attention_heads: int. Number of attention heads.
    size_per_head: int. Size of each attention head.
    query_act: (optional) Activation function for the query transform.
    key_act: (optional) Activation function for the key transform.
    value_act: (optional) Activation function for the value transform.
    attention_probs_dropout_prob: (optional) float. Dropout probability of the
    attention probabilities.
    initializer_range: float. Range of the weight initializer.
    do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
    * from_seq_length, num_attention_heads * size_per_head]. If False, the
    output will be of shape [batch_size, from_seq_length, num_attention_heads
    * size_per_head].
    batch_size: (Optional) int. If the input is 2D, this might be the batch size
    of the 3D version of the `from_tensor` and `to_tensor`.
    from_seq_length: (Optional) If the input is 2D, this might be the seq length
    of the 3D version of the `from_tensor`.
    to_seq_length: (Optional) If the input is 2D, this might be the seq length
    of the 3D version of the `to_tensor`.
    Returns:
    float Tensor of shape [batch_size, from_seq_length,
    num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
    true, this will be of shape [batch_size * from_seq_length,
    num_attention_heads * size_per_head]).
    Raises:
    ValueError: Any of the arguments or tensor shapes are invalid.
    """

    def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
    seq_length, width):
    output_tensor = tf.reshape(
    input_tensor, [batch_size, seq_length, num_attention_heads, width])

    output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
    return output_tensor

    from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
    to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

    if len(from_shape) != len(to_shape):
    raise ValueError(
    "The rank of `from_tensor` must match the rank of `to_tensor`.")

    if len(from_shape) == 3:
    batch_size = from_shape[0]
    from_seq_length = from_shape[1]
    to_seq_length = to_shape[1]
    elif len(from_shape) == 2:
    if (batch_size is None or from_seq_length is None or to_seq_length is None):
    raise ValueError(
    "When passing in rank 2 tensors to attention_layer, the values "
    "for `batch_size`, `from_seq_length`, and `to_seq_length` "
    "must all be specified.")

    # Scalar dimensions referenced here:
    # B = batch size (number of sequences)
    # F = `from_tensor` sequence length
    # T = `to_tensor` sequence length
    # N = `num_attention_heads`
    # H = `size_per_head`

    from_tensor_2d = reshape_to_matrix(from_tensor) #[batch_size*seq_length,hidden_size]
    to_tensor_2d = reshape_to_matrix(to_tensor) #[batch_size*seq_length,hidden_size]
    #首先将key和value输入进全连接层 但是激活函数为None,这里为什么我也不知道。。。
    # `query_layer` = [B*F, N*H]
    query_layer = tf.layers.dense(
    from_tensor_2d,
    num_attention_heads * size_per_head,
    activation=query_act, #None
    name="query",
    kernel_initializer=create_initializer(initializer_range)) # [batch_size*seq_length,hidden_size] hidden_size即num_attention_heads*size_per_head

    # `key_layer` = [B*T, N*H]
    key_layer = tf.layers.dense(
    to_tensor_2d,
    num_attention_heads * size_per_head,
    activation=key_act, #None
    name="key",
    kernel_initializer=create_initializer(initializer_range))

    # `value_layer` = [B*T, N*H]
    value_layer = tf.layers.dense(
    to_tensor_2d,
    num_attention_heads * size_per_head,
    activation=value_act, #None
    name="value",
    kernel_initializer=create_initializer(initializer_range))
    #reshape成四位,用于注意力矩阵运算
    # `query_layer` = [B, N, F, H]
    query_layer = transpose_for_scores(query_layer, batch_size, #将num_attention_heads调到第二维。这里表示每个batch有N个head,每个head有F个token,每个token用H表示。不同head学习不同子空间的特征
    num_attention_heads, from_seq_length,
    size_per_head)

    # `key_layer` = [B, N, T, H]
    key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
    to_seq_length, size_per_head)

    # Take the dot product between "query" and "key" to get the raw
    # attention scores. 乘法注意力
    # `attention_scores` = [B, N, F, T]
    attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
    attention_scores = tf.multiply(attention_scores,
    1.0 / math.sqrt(float(size_per_head)))

    if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    #这部分将每条训练语料的结尾padding的部分都变为一个极小值,其他有实数据的部分都为0
    # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
    # masked positions, this operation will create a tensor which is 0.0 for
    # positions we want to attend and -10000.0 for masked positions.
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

    # Since we are adding it to the raw scores before the softmax, this is
    # effectively the same as removing these entirely.
    #相加后,有实数据的部分加的,padding部分都是一个极小值
    attention_scores += adder

    # Normalize the attention scores to probabilities.
    # `attention_probs` = [B, N, F, T]
    attention_probs = tf.nn.softmax(attention_scores)

    # This is actually dropping out entire tokens to attend to, which might
    # seem a bit unusual, but is taken from the original Transformer paper.
    attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

    # `value_layer` = [B, T, N, H]
    value_layer = tf.reshape(
    value_layer,
    [batch_size, to_seq_length, num_attention_heads, size_per_head])

    # `value_layer` = [B, N, T, H]
    value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

    # `context_layer` = [B, N, F, H]
    # 注意力矩阵乘以value
    context_layer = tf.matmul(attention_probs, value_layer)

    # `context_layer` = [B, F, N, H]
    context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

    if do_return_2d_tensor:
    # 返回2D结果
    # `context_layer` = [B*F, N*V]
    context_layer = tf.reshape(
    context_layer,
    [batch_size * from_seq_length, num_attention_heads * size_per_head])
    else:
    # `context_layer` = [B, F, N*V]
    context_layer = tf.reshape(
    context_layer,
    [batch_size, from_seq_length, num_attention_heads * size_per_head])

    return context_layer
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    模型应用
    模型怎么用呢,在BertModel class中有两个函数。get_pool_output表示获取每个batch第一个词的[CLS]表示结果。BERT认为这个词包含了整条语料的信息;适用于句子级别的分类问题。get_sequence_output表示BERT最终的输出结果,shape为[batch_size,seq_length,hidden_size]。可以直观理解为对每条语料的最终表示,适用于seq2seq问题。


    def get_pooled_output(self):
    return self.pooled_outp #[batch_size, hidden_size]
    def get_sequence_output(self):
    """Gets final hidden layer of encoder.
    Returns:
    float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
    to the final hidden of the transformer encoder.
    """
    return self.sequence_output
    1
    2
    3
    4
    5
    6
    7
    8
    9
    下一篇是训练过程。最近突然有两件事要忙,所以可能要鸽几天了
    ---------------------
    作者:保持一份率性
    来源:CSDN
    原文:https://blog.csdn.net/weixin_39470744/article/details/84401339
    版权声明:本文为博主原创文章,转载请附上博文链接!

  • 相关阅读:
    遇到的一道发散思维题C#
    SQLSERVER函数判断当天是星期几
    C# Web 获取客户端IP
    C# 月度进度条实现
    FileZilla Server的安装和设置
    SQL1428N 应用程序已与 "DB2" 连接,
    DB2客户端连接服务器
    IIS提示Execute Access Denied解决办法
    dom4j java.lang.NoClassDefFoundError: org/jaxen/JaxenException
    Oracle查看用户、用户权限、用户表空间、用户默认表空间
  • 原文地址:https://www.cnblogs.com/jfdwd/p/11264931.html
Copyright © 2011-2022 走看看