zoukankan      html  css  js  c++  java
  • 谷歌BERT预训练源码解析(一):训练数据生成

    目录
    预训练源码结构简介
    输入输出
    源码解析
    参数
    主函数
    创建训练实例
    下一句预测&实例生成
    随机遮蔽
    输出
    结果一览
    预训练源码结构简介
    关于BERT,简单来说,它是一个基于Transformer架构,结合遮蔽词预测和上下句识别的预训练NLP模型。至于效果:在11种不同NLP测试中创出最佳成绩
    关于介绍BERT的文章我看了一些,个人感觉介绍的最全面的是机器之心
    再放上谷歌官方源码链接:BERT官方源码
    在看本博客之前,读者先要了解:
    1.Transformer架构
    2.BERT模型的创新之处
    3.python语言及tensorflow框架
    我会在代码中直接指出对应的原理,如果没有了解架构直接刚代码可能会有些吃力
    BERT的预训练主要分为三个部分:
    1.预训练数据的预处理(create_pretraining_data.py)
    2.核心模型的构建(modeling.py)
    3.训练过程(run_pretraining.py)
    我将分三次分别介绍这三个部分的源码,这次先介绍训练数据的训练数据生成脚本即create_pretraining_data.py。

    输入输出
    关于输入和输出,我们可以直接从官方提供的训练命令行中窥之一二

    python create_pretraining_data.py
    --input_file=./sample_text.txt
    --output_file=/tmp/tf_examples.tfrecord
    --vocab_file=$BERT_BASE_DIR/vocab.txt
    --do_lower_case=True
    --max_seq_length=128
    --max_predictions_per_seq=20
    --masked_lm_prob=0.15
    --random_seed=12345
    --dupe_factor=5
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    可以看到 这里谷歌为我们提供了一个小的训练样本sample_text.txt(输入),将这个训练样本进行处理后输出到**tf_examples.tfrecord(输出)**这个文件。在sample_text.txt中,空行前后是不同的文章,每个文章中的每句话都占一行(也就是说每篇文章的上下两行是一篇文章的上下句)。vocab_file是官方模型中提供的词汇表。
    sample_text.txt


    源码解析
    参数
    input_file:指定输入文档路径
    output_file:指定输出路径
    vocab_file:指定词典路径(谷歌已在预训练模型中提供)
    do_lower_case:为True则忽略大小写
    max_seq_length:每一条训练数据(两句话)相加后的最大长度限制
    max_predictions_per_seq:每一条训练数据mask的最大数量
    random_seed:一个随机种子
    dupe_factor:对文档多次重复随机产生训练集,随机的次数
    masked_lm_prob:一条训练数据产生mask的概率,即每条训练数据随机产生max_predictions_per_seq×masked_lm_prob数量的mask
    short_seq_prob:为了缩小预训练和微调过程的差距,以此概率产生小于max_seq_length的训练数据

    from __future__ import absolute_import
    from __future__ import division
    from __future__ import print_function

    import collections
    import random

    import tokenization
    import tensorflow as tf

    flags = tf.flags

    FLAGS = flags.FLAGS

    flags.DEFINE_string("input_file", None,
    "Input raw text file (or comma-separated list of files).")

    flags.DEFINE_string(
    "output_file", None,
    "Output TF example file (or comma-separated list of files).")

    flags.DEFINE_string("vocab_file", None,
    "The vocabulary file that the BERT model was trained on.")

    flags.DEFINE_bool(
    "do_lower_case", True,
    "Whether to lower case the input text. Should be True for uncased "
    "models and False for cased models.")

    flags.DEFINE_integer("max_seq_length", 128, "Maximum sequence length.")

    flags.DEFINE_integer("max_predictions_per_seq", 20,
    "Maximum number of masked LM predictions per sequence.")

    flags.DEFINE_integer("random_seed", 12345, "Random seed for data generation.")

    flags.DEFINE_integer(
    "dupe_factor", 10,
    "Number of times to duplicate the input data (with different masks).")

    flags.DEFINE_float("masked_lm_prob", 0.15, "Masked LM probability.")

    flags.DEFINE_float(
    "short_seq_prob", 0.1,
    "Probability of creating sequences which are shorter than the "
    "maximum length.")
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    主函数
    首先获取输入文本列表,对输入文本创建训练实例,再进行输出
    简要介绍一下FullTokenizer这个类,它以vocab_file为词典,将词转化为该词对应的id,对于某些特殊词,如johanson,会先将johanson按照最大长度拆分,再看拆分的部分是否在vocab_file里。vocab_file里有没有"johanson"这个词,但有"johan"和"##son"这两个词,所以将"johanson"这个词拆分成两个词(##表示非开头匹配)

    def main(_):
    tf.logging.set_verbosity(tf.logging.INFO)

    tokenizer = tokenization.FullTokenizer(
    vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)

    input_files = []
    for input_pattern in FLAGS.input_file.split(","):
    input_files.extend(tf.gfile.Glob(input_pattern)) #获得输入文件列表

    tf.logging.info("*** Reading from input files ***")
    for input_file in input_files:
    tf.logging.info(" %s", input_file)

    rng = random.Random(FLAGS.random_seed)
    instances = create_training_instances( #创建训练实例
    input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor,
    FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq,
    rng)

    output_files = FLAGS.output_file.split(",")
    tf.logging.info("*** Writing to output files ***")
    for output_file in output_files:
    tf.logging.info(" %s", output_file)

    write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length, #输出
    FLAGS.max_predictions_per_seq, output_files)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    创建训练实例
    这部分先将文章和每篇文章的每个句子加到二维列表,再将列表传入create_instances_from_document生成训练实例.
    返回值:instances 一个列表 里面包含每个样例的TrainingInstance类

    def create_training_instances(input_files, tokenizer, max_seq_length,
    dupe_factor, short_seq_prob, masked_lm_prob,
    max_predictions_per_seq, rng):
    """Create `TrainingInstance`s from raw text."""
    all_documents = [[]]

    # Input file format:
    # (1) One sentence per line. These should ideally be actual sentences, not
    # entire paragraphs or arbitrary spans of text. (Because we use the
    # sentence boundaries for the "next sentence prediction" task).
    # (2) Blank lines between documents. Document boundaries are needed so
    # that the "next sentence prediction" task doesn't span between documents.
    for input_file in input_files:
    with tf.gfile.GFile(input_file, "r") as reader:
    while True:
    line = tokenization.convert_to_unicode(reader.readline())
    if not line:
    break
    line = line.strip()

    # Empty lines are used as document delimiters
    if not line:
    all_documents.append([])
    tokens = tokenizer.tokenize(line)
    if tokens:
    all_documents[-1].append(tokens) #二维列表 [文章,句子]

    # Remove empty documents
    all_documents = [x for x in all_documents if x] #删除空列表
    rng.shuffle(all_documents) #随机排序

    vocab_words = list(tokenizer.vocab.keys())
    instances = []
    for _ in range(dupe_factor):
    for document_index in range(len(all_documents)):
    instances.extend(
    create_instances_from_document(
    all_documents, document_index, max_seq_length, short_seq_prob,
    masked_lm_prob, max_predictions_per_seq, vocab_words, rng))

    rng.shuffle(instances)
    return instances
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    下一句预测&实例生成
    这部分是生成训练数据的具体过程,对每条数据生成TrainingInstance。这里的每条数据其实包含两个句子的信息。TrainingInstance包括tokens,segement_ids,is_random_next,masked_lm_positions,masked_lm_labels。下面给出这些属性的含义
    tokens:词
    segement_id:句子编码 第一句为0 第二句为1
    is_random_next:第二句是随机查找,还是为第一句的下文
    masked_lm_positions:tokens中被mask的位置
    masked_lm_labels:tokens中被mask的原来的词
    本部分含有BERT的创新点之一:下一句预测 类标的生成
    返回值:instances
    以下在关键代码出进行注释

    def create_instances_from_document(
    all_documents, document_index, max_seq_length, short_seq_prob,
    masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
    """Creates `TrainingInstance`s for a single document."""
    document = all_documents[document_index]

    # Account for [CLS], [SEP], [SEP]
    max_num_tokens = max_seq_length - 3

    # We *usually* want to fill up the entire sequence since we are padding
    # to `max_seq_length` anyways, so short sequences are generally wasted
    # computation. However, we *sometimes*
    # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
    # sequences to minimize the mismatch between pre-training and fine-tuning.
    # The `target_seq_length` is just a rough target however, whereas
    # `max_seq_length` is a hard limit.
    target_seq_length = max_num_tokens
    if rng.random() < short_seq_prob: #产生一个随机数如果小于short_seq_prob 则产生一个较短的训练序列
    target_seq_length = rng.randint(2, max_num_tokens)

    # We DON'T just concatenate all of the tokens from a document into a long
    # sequence and choose an arbitrary split point because this would make the
    # next sentence prediction task too easy. Instead, we split the input into
    # segments "A" and "B" based on the actual "sentences" provided by the user
    # input.
    instances = []
    current_chunk = [] #产生训练集的候选集
    current_length = 0
    i = 0
    while i < len(document):
    segment = document[i]
    current_chunk.append(segment)
    current_length += len(segment)
    if i == len(document) - 1 or current_length >= target_seq_length:
    if current_chunk:
    # `a_end` is how many segments from `current_chunk` go into the `A`
    # (first) sentence.
    a_end = 1
    if len(current_chunk) >= 2:
    a_end = rng.randint(1, len(current_chunk) - 1) #从current_chunk中随机选出一个文档作为句子1的截止文档

    tokens_a = []
    for j in range(a_end):
    tokens_a.extend(current_chunk[j]) #将截止文档之前的文档都加入到tokens_a

    tokens_b = []
    # Random next
    is_random_next = False
    if len(current_chunk) == 1 or rng.random() < 0.5: #候选集只有一句的情况则随机抽取句子作为句子2;或以0.5的概率随机抽取句子作为句子2
    is_random_next = True
    target_b_length = target_seq_length - len(tokens_a)

    # This should rarely go for more than one iteration for large
    # corpora. However, just to be careful, we try to make sure that
    # the random document is not the same as the document
    # we're processing.
    for _ in range(10):
    random_document_index = rng.randint(0, len(all_documents) - 1)
    if random_document_index != document_index:
    break

    random_document = all_documents[random_document_index] #随机找一个文档作为截止文档
    random_start = rng.randint(0, len(random_document) - 1) #随机找一个初始文档
    for j in range(random_start, len(random_document)):
    tokens_b.extend(random_document[j]) #将随机文档加入到token_b
    if len(tokens_b) >= target_b_length:
    break
    # We didn't actually use these segments so we "put them back" so
    # they don't go to waste.
    num_unused_segments = len(current_chunk) - a_end
    i -= num_unused_segments
    # Actual next
    else:
    is_random_next = False 以第1句的后续作为句子2
    for j in range(a_end, len(current_chunk)):
    tokens_b.extend(current_chunk[j])
    truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng) #对两个句子进行长度剪裁

    assert len(tokens_a) >= 1
    assert len(tokens_b) >= 1

    tokens = []
    segment_ids = []
    tokens.append("[CLS]")
    segment_ids.append(0)
    for token in tokens_a:
    tokens.append(token)
    segment_ids.append(0)

    tokens.append("[SEP]")
    segment_ids.append(0)

    for token in tokens_b:
    tokens.append(token)
    segment_ids.append(1)
    tokens.append("[SEP]")
    segment_ids.append(1)

    (tokens, masked_lm_positions,
    masked_lm_labels) = create_masked_lm_predictions( #对token创建mask
    tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
    instance = TrainingInstance(
    tokens=tokens,
    segment_ids=segment_ids,
    is_random_next=is_random_next,
    masked_lm_positions=masked_lm_positions,
    masked_lm_labels=masked_lm_labels)
    instances.append(instance)
    current_chunk = []
    current_length = 0
    i += 1

    return instances
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    随机遮蔽
    这部分对token进行随机mask。这部分是BERT的创新点之二,随机遮蔽。为了防止双向模型在多层之后“看到自己”。这里对一部分词进行随机遮蔽,并在预训练中进行预测。遮蔽方案:
    1.以80%的概率直接变成[MASK]
    2.以10%的概率保留原词
    3.以10%的概率在词典中随机找一个词替代
    返回值:经过随机遮蔽后的(词,遮蔽位置,遮蔽前原词)

    def create_masked_lm_predictions(tokens, masked_lm_prob,
    max_predictions_per_seq, vocab_words, rng):
    """Creates the predictions for the masked LM objective."""

    cand_indexes = []
    for (i, token) in enumerate(tokens):
    if token == "[CLS]" or token == "[SEP]":
    continue
    cand_indexes.append(i)

    rng.shuffle(cand_indexes) #打乱顺序

    output_tokens = list(tokens)

    masked_lm = collections.namedtuple("masked_lm", ["index", "label"]) # p定义一个名为masked_lm的元组,里面有两个属性

    num_to_predict = min(max_predictions_per_seq,
    max(1, int(round(len(tokens) * masked_lm_prob)))) #所有要mask的词的数量为定值,取两个定义好参数的最小值

    masked_lms = []
    covered_indexes = set()
    for index in cand_indexes:
    if len(masked_lms) >= num_to_predict:
    break
    if index in covered_indexes:
    continue
    covered_indexes.add(index) #要被mask的词的index

    masked_token = None
    # 80% of the time, replace with [MASK]
    if rng.random() < 0.8:
    masked_token = "[MASK]"
    else:
    # 10% of the time, keep original
    if rng.random() < 0.5:
    masked_token = tokens[index]
    # 10% of the time, replace with random word
    else:
    masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]

    output_tokens[index] = masked_token #用masked_token替换原词

    masked_lms.append(masked_lm(index=index, label=tokens[index]))

    masked_lms = sorted(masked_lms, key=lambda x: x.index)

    masked_lm_positions = []
    masked_lm_labels = []
    for p in masked_lms:
    masked_lm_positions.append(p.index) #被mask的index
    masked_lm_labels.append(p.label) #被mask的label(即原词)

    return (output_tokens, masked_lm_positions, masked_lm_labels)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    输出
    最后是将处理好的数据保存为tfrecord文件。首先将token转为id,增加input_mask用于记录实句长度。最后将不到最大长度的部分用0补齐。

    def write_instance_to_example_files(instances, tokenizer, max_seq_length,
    max_predictions_per_seq, output_files):
    """Create TF example files from `TrainingInstance`s."""
    writers = []
    for output_file in output_files:
    writers.append(tf.python_io.TFRecordWriter(output_file))

    writer_index = 0

    total_written = 0
    for (inst_index, instance) in enumerate(instances):
    input_ids = tokenizer.convert_tokens_to_ids(instance.tokens) #词转id
    input_mask = [1] * len(input_ids)
    segment_ids = list(instance.segment_ids)
    assert len(input_ids) <= max_seq_length

    while len(input_ids) < max_seq_length: #未到最大长度时后面补0
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length

    masked_lm_positions = list(instance.masked_lm_positions) #mask位置记录
    masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels) #mask预测值转id
    masked_lm_weights = [1.0] * len(masked_lm_ids) #mask位置的权重都为1,用于排除后续的“0”以便loss计算

    while len(masked_lm_positions) < max_predictions_per_seq: #补0
    masked_lm_positions.append(0)
    masked_lm_ids.append(0)
    masked_lm_weights.append(0.0)

    next_sentence_label = 1 if instance.is_random_next else 0

    features = collections.OrderedDict()
    features["input_ids"] = create_int_feature(input_ids)
    features["input_mask"] = create_int_feature(input_mask)
    features["segment_ids"] = create_int_feature(segment_ids)
    features["masked_lm_positions"] = create_int_feature(masked_lm_positions)
    features["masked_lm_ids"] = create_int_feature(masked_lm_ids)
    features["masked_lm_weights"] = create_float_feature(masked_lm_weights)
    features["next_sentence_labels"] = create_int_feature([next_sentence_label])

    tf_example = tf.train.Example(features=tf.train.Features(feature=features)) #生成训练样例

    writers[writer_index].write(tf_example.SerializeToString()) #输出到文件
    writer_index = (writer_index + 1) % len(writers)

    total_written += 1

    if inst_index < 20: 对前20个训练样例进行打印
    tf.logging.info("*** Example ***")
    tf.logging.info("tokens: %s" % " ".join(
    [tokenization.printable_text(x) for x in instance.tokens]))

    for feature_name in features.keys():
    feature = features[feature_name]
    values = []
    if feature.int64_list.value:
    values = feature.int64_list.value
    elif feature.float_list.value:
    values = feature.float_list.value
    tf.logging.info(
    "%s: %s" % (feature_name, " ".join([str(x) for x in values])))

    for writer in writers:
    writer.close()

    tf.logging.info("Wrote %d total instances", total_written)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    结果一览
    最后打印的结果是这酱的

    谷歌对训练数据的处理就介绍这么多,如果有错误欢迎大家批评指正,如果有问题也欢迎大家提问互相探讨。关于模型篇的代码解析我会在下一篇博客中给出。
    ---------------------
    作者:保持一份率性
    来源:CSDN
    原文:https://blog.csdn.net/weixin_39470744/article/details/84373933
    版权声明:本文为博主原创文章,转载请附上博文链接!

  • 相关阅读:
    1033 To Fill or Not to Fill (25分)(贪心)
    CentOS(五)--Linux系统的分区概念
    Linux安装Oracle 11G过程(测试未写完)
    【VMware虚拟化解决方案】设计和配置VMware vCenter 5.5
    CentOS(四)--Linux系统的启动级别
    CentOS(三)--初识linux的文件系统以及用户组等概念
    CentOS(二)--初识linux的一些常用命令
    CentOS(一)--CentOS6.4环境搭建
    Linux c/c++图片传输功能(中级版)
    remote uptime 服务器程序
  • 原文地址:https://www.cnblogs.com/jfdwd/p/11264942.html
Copyright © 2011-2022 走看看