zoukankan      html  css  js  c++  java
  • 基于bert命名实体识别(一)数据处理

    要使用官方的tensorflow版本的bert微调进行自己的命名实体识别,需要处理数据成bert相应的格式,主要是在run_classifier.py中,比如说:

    class MnliProcessor(DataProcessor):
      """Processor for the MultiNLI data set (GLUE version)."""
    
      def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
    
      def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
            "dev_matched")
    
      def get_test_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test")
    
      def get_labels(self):
        """See base class."""
        return ["contradiction", "entailment", "neutral"]
    
      def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
          if i == 0:
            continue
          guid = "%s-%s" % (set_type, tokenization.convert_to_unicode(line[0]))
          text_a = tokenization.convert_to_unicode(line[8])
          text_b = tokenization.convert_to_unicode(line[9])
          if set_type == "test":
            label = "contradiction"
          else:
            label = tokenization.convert_to_unicode(line[-1])
          examples.append(
              InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

    然后在main()函数中加入:

      processors = {
          "cola": ColaProcessor,
          "mnli": MnliProcessor,
          "mrpc": MrpcProcessor,
          "xnli": XnliProcessor,
      }

    现在我们有以下数据:

    每个txt中文件的部分内容是:

    美 B-LOC
    国 I-LOC
    的 O
    华 B-PER
    莱 I-PER
    士 I-PER
    , O
    我 O
    和 O
    他 O
    谈 O
    笑 O
    风 O
    生 O
    。 O

    接下来我们要使用这些数据转换成相应的格式。

    在DataProcessor类中的_read_data(cls,input_file)方法是将txt中的内容制作成以下格式:

    [['B-LOC I-LOC O B-PER I-PER I-PER O O O O O O O O O', '美 国 的 华 莱 士 , 我 和 他 谈 笑 风 生 。'], ['O B-PER I-PER O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O', '看 包 公 断 案 的 戏 , 看 他 威 风 凛 凛 坐 公 堂 拍 桌 子 动 刑 具 , 多 少 还 有 一 点 担 心 , 总 怕 靠 这 一 套 办 法 弄 出 错 案 来 , 放 过 了 真 正 的 坏 人 ;'], ......]

    接下来我们就可以定义我们自己的数据处理类了:

    class NerProcessor(DataProcessor):
        def get_train_examples(self, data_dir):
            return self._create_example(
                self._read_data(os.path.join(data_dir, "train.txt")), "train"
            )
    
        def get_dev_examples(self, data_dir):
            return self._create_example(
                self._read_data(os.path.join(data_dir, "dev.txt")), "dev"
            )
    
        def get_test_examples(self,data_dir):
            return self._create_example(
                self._read_data(os.path.join(data_dir, "test.txt")), "test")
    
    
        def get_labels(self):
            # prevent potential bug for chinese text mixed with english text
            # return ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "[CLS]","[SEP]"]
            return ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "X","[CLS]","[SEP]"]
    
        def _create_example(self, lines, set_type):
            examples = []
            for (i, line) in enumerate(lines):
                guid = "%s-%s" % (set_type, i)
                text = tokenization.convert_to_unicode(line[1])
                label = tokenization.convert_to_unicode(line[0])
                examples.append(InputExample(guid=guid, text=text, label=label))
            return examples

    这里调用了一个函数:tokenization.convert_to_unicode()和使用了一个类:InputExample,我们分别来看

    tokenization.convert_to_unicode()位于同级目录下的tokenization.py中,比如我们输入以下内容:

    line = ['B-LOC I-LOC O B-PER I-PER I-PER O O O O O O O O O', '美 国 的 华 莱 士 , 我 和 他 谈 笑 风 生 。']
    import six
    def convert_to_unicode(text):
      """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
      if six.PY3:
        if isinstance(text, str):
          return text
        elif isinstance(text, bytes):
          return text.decode("utf-8", "ignore")
        else:
          raise ValueError("Unsupported string type: %s" % (type(text)))
      elif six.PY2:
        if isinstance(text, str):
          return text.decode("utf-8", "ignore")
        elif isinstance(text, unicode):
          return text
        else:
          raise ValueError("Unsupported string type: %s" % (type(text)))
      else:
        raise ValueError("Not running on Python2 or Python 3?")
    text = convert_to_unicode(line[1])
    label = convert_to_unicode(line[0])
    print(text)
    print(label)

    输出:

    美 国 的 华 莱 士 , 我 和 他 谈 笑 风 生 。
    B-LOC I-LOC O B-PER I-PER I-PER O O O O O O O O O

    InputExample类如下所示:

    class InputExample(object):
        """A single training/test example for simple sequence classification."""
    
        def __init__(self, guid, text, label=None):
            """Constructs a InputExample.
            Args:
              guid: Unique id for the example.
              text_a: string. The untokenized text of the first sequence. For single
                sequence tasks, only this sequence must be specified.
              label: (Optional) string. The label of the example. This should be
                specified for train and dev examples, but not for test examples.
            """
            self.guid = guid
            self.text = text
            self.label = label

    self.guid是为了给每一个句子分配一个唯一的id,而且是区分训练、验证和测试的。

    然后我们从main()中继续来看:只与数据处理有关的

    构建如下字典

      processors = {
            "ner": NerProcessor
        }

    获取标签列表

    label_list = processor.get_labels()

    将词汇表中的字映射成id表示

    tokenizer = tokenization.FullTokenizer(
            vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)

    这里调用了tokenization.FullTokenizer(),看一下是什么:

    class FullTokenizer(object):
      """Runs end-to-end tokenziation."""
    
      def __init__(self, vocab_file, do_lower_case=True):
        self.vocab = load_vocab(vocab_file)
        self.inv_vocab = {v: k for k, v in self.vocab.items()}
        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
    
      def tokenize(self, text):
        split_tokens = []
        for token in self.basic_tokenizer.tokenize(text):
          for sub_token in self.wordpiece_tokenizer.tokenize(token):
            split_tokens.append(sub_token)
    
        return split_tokens
    
      def convert_tokens_to_ids(self, tokens):
        return convert_by_vocab(self.vocab, tokens)
    
      def convert_ids_to_tokens(self, ids):
        return convert_by_vocab(self.inv_vocab, ids)

    这里用到了一些函数和类:

    def load_vocab(vocab_file):
      """Loads a vocabulary file into a dictionary."""
      vocab = collections.OrderedDict()
      index = 0
      with tf.gfile.GFile(vocab_file, "r") as reader:
        while True:
          token = convert_to_unicode(reader.readline())
          if not token:
            break
          token = token.strip()
          vocab[token] = index
          index += 1
      return vocab
    
    
    def convert_by_vocab(vocab, items):
      """Converts a sequence of [tokens|ids] using the vocab."""
      output = []
      for item in items:
        output.append(vocab[item])
      return output

    load_vocab()的作用就是将每一个字映射成id的形式,比如:

    OrderedDict([('[PAD]', 0), ('[unused1]', 1), ('[unused2]', 2), ('[unused3]', 3), ('[unused4]', 4), ......

    接下来是两个类:BasicTokenizer和WordpieceTokenizer

    class BasicTokenizer(object):
      """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
    
      def __init__(self, do_lower_case=True):
        """Constructs a BasicTokenizer.
        Args:
          do_lower_case: Whether to lower case the input.
        """
        self.do_lower_case = do_lower_case
    
      def tokenize(self, text):
        """Tokenizes a piece of text."""
        text = convert_to_unicode(text)
        text = self._clean_text(text)
    
        # This was added on November 1st, 2018 for the multilingual and Chinese
        # models. This is also applied to the English models now, but it doesn't
        # matter since the English models were not trained on any Chinese data
        # and generally don't have any Chinese data in them (there are Chinese
        # characters in the vocabulary because Wikipedia does have some Chinese
        # words in the English Wikipedia.).
        text = self._tokenize_chinese_chars(text)
    
        orig_tokens = whitespace_tokenize(text)
        split_tokens = []
        for token in orig_tokens:
          if self.do_lower_case:
            token = token.lower()
            token = self._run_strip_accents(token)
          split_tokens.extend(self._run_split_on_punc(token))
    
        output_tokens = whitespace_tokenize(" ".join(split_tokens))
        return output_tokens
    
      def _run_strip_accents(self, text):
        """Strips accents from a piece of text."""
        text = unicodedata.normalize("NFD", text)
        output = []
        for char in text:
          cat = unicodedata.category(char)
          if cat == "Mn":
            continue
          output.append(char)
        return "".join(output)
    
      def _run_split_on_punc(self, text):
        """Splits punctuation on a piece of text."""
        chars = list(text)
        i = 0
        start_new_word = True
        output = []
        while i < len(chars):
          char = chars[i]
          if _is_punctuation(char):
            output.append([char])
            start_new_word = True
          else:
            if start_new_word:
              output.append([])
            start_new_word = False
            output[-1].append(char)
          i += 1
    
        return ["".join(x) for x in output]
    
      def _tokenize_chinese_chars(self, text):
        """Adds whitespace around any CJK character."""
        output = []
        for char in text:
          cp = ord(char)
          if self._is_chinese_char(cp):
            output.append(" ")
            output.append(char)
            output.append(" ")
          else:
            output.append(char)
        return "".join(output)
    
      def _is_chinese_char(self, cp):
        """Checks whether CP is the codepoint of a CJK character."""
        # This defines a "chinese character" as anything in the CJK Unicode block:
        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
        #
        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
        # despite its name. The modern Korean Hangul alphabet is a different block,
        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
        # space-separated words, so they are not treated specially and handled
        # like the all of the other languages.
        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
            (cp >= 0x3400 and cp <= 0x4DBF) or  #
            (cp >= 0x20000 and cp <= 0x2A6DF) or  #
            (cp >= 0x2A700 and cp <= 0x2B73F) or  #
            (cp >= 0x2B740 and cp <= 0x2B81F) or  #
            (cp >= 0x2B820 and cp <= 0x2CEAF) or
            (cp >= 0xF900 and cp <= 0xFAFF) or  #
            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
          return True
    
        return False
    
      def _clean_text(self, text):
        """Performs invalid character removal and whitespace cleanup on text."""
        output = []
        for char in text:
          cp = ord(char)
          if cp == 0 or cp == 0xfffd or _is_control(char):
            continue
          if _is_whitespace(char):
            output.append(" ")
          else:
            output.append(char)
        return "".join(output)
    
    
    class WordpieceTokenizer(object):
      """Runs WordPiece tokenziation."""
    
      def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
        self.vocab = vocab
        self.unk_token = unk_token
        self.max_input_chars_per_word = max_input_chars_per_word
    
      def tokenize(self, text):
        """Tokenizes a piece of text into its word pieces.
        This uses a greedy longest-match-first algorithm to perform tokenization
        using the given vocabulary.
        For example:
          input = "unaffable"
          output = ["un", "##aff", "##able"]
        Args:
          text: A single token or whitespace separated tokens. This should have
            already been passed through `BasicTokenizer.
        Returns:
          A list of wordpiece tokens.
        """
    
        text = convert_to_unicode(text)
    
        output_tokens = []
        for token in whitespace_tokenize(text):
          chars = list(token)
          if len(chars) > self.max_input_chars_per_word:
            output_tokens.append(self.unk_token)
            continue
    
          is_bad = False
          start = 0
          sub_tokens = []
          while start < len(chars):
            end = len(chars)
            cur_substr = None
            while start < end:
              substr = "".join(chars[start:end])
              if start > 0:
                substr = "##" + substr
              if substr in self.vocab:
                cur_substr = substr
                break
              end -= 1
            if cur_substr is None:
              is_bad = True
              break
            sub_tokens.append(cur_substr)
            start = end
    
          if is_bad:
            output_tokens.append(self.unk_token)
          else:
            output_tokens.extend(sub_tokens)
        return output_tokens
    
    
    def _is_whitespace(char):
      """Checks whether `chars` is a whitespace character."""
      # 	, 
    , and 
     are technically contorl characters but we treat them
      # as whitespace since they are generally considered as such.
      if char == " " or char == "	" or char == "
    " or char == "
    ":
        return True
      cat = unicodedata.category(char)
      if cat == "Zs":
        return True
      return False
    
    
    def _is_control(char):
      """Checks whether `chars` is a control character."""
      # These are technically control characters but we count them as whitespace
      # characters.
      if char == "	" or char == "
    " or char == "
    ":
        return False
      cat = unicodedata.category(char)
      if cat in ("Cc", "Cf"):
        return True
      return False

    都调用了tokenizer()方法。

    首先看BasicTokenizer中的tokenizer()方法:

      def tokenize(self, text):
        """Tokenizes a piece of text."""
        text = convert_to_unicode(text)
        text = self._clean_text(text)
    
        # This was added on November 1st, 2018 for the multilingual and Chinese
        # models. This is also applied to the English models now, but it doesn't
        # matter since the English models were not trained on any Chinese data
        # and generally don't have any Chinese data in them (there are Chinese
        # characters in the vocabulary because Wikipedia does have some Chinese
        # words in the English Wikipedia.).
        text = self._tokenize_chinese_chars(text)
    
        orig_tokens = whitespace_tokenize(text)
        split_tokens = []
        for token in orig_tokens:
          if self.do_lower_case:
            token = token.lower()
            token = self._run_strip_accents(token)
          split_tokens.extend(self._run_split_on_punc(token))
    
        output_tokens = whitespace_tokenize(" ".join(split_tokens))
        return output_tokens
    • convert_to_unicode(text):用于将tex中的字转换为unicode
    • self._clean_text(text):去除一些无意义的字符
    • self._tokenize_chinese_chars(text):用于切分中文,这里的中文分词很简单,就是切分成一个一个的汉字。也就是在中文字符的前后加上空格,这样后续的分词流程会把每一个字符当成一个词。这里的关键是调用_is_chinese_char函数,这个函数用于判断一个unicode字符是否中文字符。
    • whitespace_tokenize(text):用于将text切分成由每一个字组成的列表
    • 对于每一个字,先将其转换为小写(针对于英文),然后调用self._run_strip_accents(token):它的作用是去掉accent。
    • self._run_split_on_punc(token):对输入字符串用标点进行切分,返回一个list,list的每一个元素都是一个char。比如输入he’s,则输出是[[h,e], [’],[s]]。里面它会调用函数_is_punctuation来判断一个字符是否标点。

    然后是WordpieceTokenizer中的tokenizer():

    WordpieceTokenizer的作用是把词再切分成更细粒度的WordPiece。WordPiece(Byte Pair Encoding)是一种解决OOV问题的方法,如果不管细节,我们把它看成比词更小的基本单位就行。对于中文来说,WordpieceTokenizer什么也不干,因为之前的分词已经是基于字符的了。

    继续看main()函数,接下来是:

       if FLAGS.do_train:
            train_examples = processor.get_train_examples(FLAGS.data_dir)
            num_train_steps = int(
                len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
            num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)

    然后是:

            filed_based_convert_examples_to_features(
                train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)

    这个filed_based_convert_examples_to_features()函数是获得最终数据的关键:

    def filed_based_convert_examples_to_features(
            examples, label_list, max_seq_length, tokenizer, output_file,mode=None
    ):
        label_map = {}
        for (i, label) in enumerate(label_list,1):
            label_map[label] = i
        with open('./output/label2id.pkl','wb') as w:
            pickle.dump(label_map,w)
    
        writer = tf.python_io.TFRecordWriter(output_file)
        for (ex_index, example) in enumerate(examples):
            if ex_index % 5000 == 0:
                tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
            feature = convert_single_example(ex_index, example, label_map, max_seq_length, tokenizer,mode)
            
            def create_int_feature(values):
                f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
                return f
    
            features = collections.OrderedDict()
            features["input_ids"] = create_int_feature(feature.input_ids)
            features["input_mask"] = create_int_feature(feature.input_mask)
            features["segment_ids"] = create_int_feature(feature.segment_ids)
            features["label_ids"] = create_int_feature(feature.label_ids)
            #features["label_mask"] = create_int_feature(feature.label_mask)
            tf_example = tf.train.Example(features=tf.train.Features(feature=features))
            writer.write(tf_example.SerializeToString())

    file_based_convert_examples_to_features函数遍历每一个example(InputExample类的对象)。然后使用convert_single_example函数把每个InputExample对象变成InputFeature。InputFeature就是一个存放特征的对象,它包括input_ids、input_mask、segment_ids和label_id,这4个属性除了label_id是一个int之外,其它都是int的列表,因此使用create_int_feature函数把它变成tf.train.Feature,而label_id需要构造一个只有一个元素的列表,最后构造tf.train.Example对象,然后写到TFRecord文件里。后面Estimator的input_fn会用到它。

    这里的最关键是convert_single_example函数,读懂了它就真正明白BERT把输入表示成向量的过程,所以请读者仔细阅读代码和其中的注释。

    def convert_single_example(ex_index, example, label_map, max_seq_length, tokenizer,mode):
        textlist = example.text.split(' ')
        labellist = example.label.split(' ')
        tokens = []
        labels = []
        # print(textlist)
        for i, word in enumerate(textlist):
            token = tokenizer.tokenize(word)
            # print(token)
            tokens.extend(token)
            label_1 = labellist[i]
            # print(label_1)
            for m in range(len(token)):
                if m == 0:
                    labels.append(label_1)
                else:
                    labels.append("X")
            # print(tokens, labels)
        # tokens = tokenizer.tokenize(example.text)
        if len(tokens) >= max_seq_length - 1:
            tokens = tokens[0:(max_seq_length - 2)]
            labels = labels[0:(max_seq_length - 2)]
        ntokens = []
        segment_ids = []
        label_ids = []
        ntokens.append("[CLS]")
        segment_ids.append(0)
        # append("O") or append("[CLS]") not sure!
        label_ids.append(label_map["[CLS]"])
        for i, token in enumerate(tokens):
            ntokens.append(token)
            segment_ids.append(0)
            label_ids.append(label_map[labels[i]])
        ntokens.append("[SEP]")
        segment_ids.append(0)
        # append("O") or append("[SEP]") not sure!
        label_ids.append(label_map["[SEP]"])
        input_ids = tokenizer.convert_tokens_to_ids(ntokens)
        input_mask = [1] * len(input_ids)
        #label_mask = [1] * len(input_ids)
        while len(input_ids) < max_seq_length:
            input_ids.append(0)
            input_mask.append(0)
            segment_ids.append(0)
            # we don't concerned about it!
            label_ids.append(0)
            ntokens.append("**NULL**")
            #label_mask.append(0)
        # print(len(input_ids))
        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length
        assert len(label_ids) == max_seq_length
        #assert len(label_mask) == max_seq_length
    
        if ex_index < 5:
            tf.logging.info("*** Example ***")
            tf.logging.info("guid: %s" % (example.guid))
            tf.logging.info("tokens: %s" % " ".join(
                [tokenization.printable_text(x) for x in tokens]))
            tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
            tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
            tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
            tf.logging.info("label_ids: %s" % " ".join([str(x) for x in label_ids]))
            #tf.logging.info("label_mask: %s" % " ".join([str(x) for x in label_mask]))
    
        feature = InputFeatures(
            input_ids=input_ids,
            input_mask=input_mask,
            segment_ids=segment_ids,
            label_ids=label_ids,
            #label_mask = label_mask
        )
        write_tokens(ntokens,mode)
        return feature

    得到的数据是这个样子的:

    24:08.344610 139705382696832 BERT_NER.py:270] guid: train-0
    INFO:tensorflow:tokens: 当 希 望 工 程 救 助 的 百 万 儿 童 成 长 起 来 , 科 教 兴 国 蔚 然 成 风 时 , 今 天 有 收 藏 价 值 的 书 你 没 买 , 明 日 就 叫 你 悔 不 当 初 !
    I1122 06:24:08.344719 139705382696832 BERT_NER.py:272] tokens: 当 希 望 工 程 救 助 的 百 万 儿 童 成 长 起 来 , 科 教 兴 国 蔚 然 成 风 时 , 今 天 有 收 藏 价 值 的 书 你 没 买 , 明 日 就 叫 你 悔 不 当 初 !
    INFO:tensorflow:input_ids: 101 2496 2361 3307 2339 4923 3131 1221 4638 4636 674 1036 4997 2768 7270 6629 3341 8024 4906 3136 1069 1744 5917 4197 2768 7599 3198 8024 791 1921 3300 3119 5966 817 966 4638 741 872 3766 743 8024 3209 3189 2218 1373 872 2637 679 2496 1159 8013 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    I1122 06:24:08.344846 139705382696832 BERT_NER.py:273] input_ids: 101 2496 2361 3307 2339 4923 3131 1221 4638 4636 674 1036 4997 2768 7270 6629 3341 8024 4906 3136 1069 1744 5917 4197 2768 7599 3198 8024 791 1921 3300 3119 5966 817 966 4638 741 872 3766 743 8024 3209 3189 2218 1373 872 2637 679 2496 1159 8013 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    I1122 06:24:08.344969 139705382696832 BERT_NER.py:274] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    I1122 06:24:08.345084 139705382696832 BERT_NER.py:275] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    INFO:tensorflow:label_ids: 9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    I1122 06:24:08.345226 139705382696832 BERT_NER.py:276] label_ids: 9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    INFO:tensorflow:*** Example ***

    说明:标签是['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'X', '[CLS]', '[SEP]']

    • tokens:分词处理之后的结果
    • input_ids:将字转换为对应的id
    • input_mask:当长度小于最大长度时,小于的部分用0进行填充
    • segment_ids:0表示第一句话,1表示第二句话,由于这里的任务是命名实体识别,所以只有一句话,都是0
    • label_ids:标签所对应的id,但是每一句话句首增加了[CLS],句尾增加了[SEP],需要注意的是这里的id是从1开始的,即1表示O,因为不足的地方使用0进行了填充。

    最后将其包装为:

    class InputFeatures(object):
        """A single set of features of data."""
    
        def __init__(self, input_ids, input_mask, segment_ids, label_ids,):
            self.input_ids = input_ids
            self.input_mask = input_mask
            self.segment_ids = segment_ids
            self.label_ids = label_ids
            #self.label_mask = label_mask
    feature = InputFeatures(
            input_ids=input_ids,
            input_mask=input_mask,
            segment_ids=segment_ids,
            label_ids=label_ids,
            #label_mask = label_mask
        )

    最后这么使用:

    def file_based_input_fn_builder(input_file, seq_length, is_training,
                                    drop_remainder):
      """Creates an `input_fn` closure to be passed to TPUEstimator."""
    
      name_to_features = {
          "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
          "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
          "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
          "label_ids": tf.FixedLenFeature([], tf.int64),
          "is_real_example": tf.FixedLenFeature([], tf.int64),
      }
    
      def _decode_record(record, name_to_features):
        """Decodes a record to a TensorFlow example."""
        example = tf.parse_single_example(record, name_to_features)
    
        # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
        # So cast all int64 to int32.
        for name in list(example.keys()):
          t = example[name]
          if t.dtype == tf.int64:
            t = tf.to_int32(t)
          example[name] = t
    
        return example
    
      def input_fn(params):
        """The actual input function."""
        batch_size = params["batch_size"]
    
        # For training, we want a lot of parallel reading and shuffling.
        # For eval, we want no shuffling and parallel reading doesn't matter.
        d = tf.data.TFRecordDataset(input_file)
        if is_training:
          d = d.repeat()
          d = d.shuffle(buffer_size=100)
    
        d = d.apply(
            tf.contrib.data.map_and_batch(
                lambda record: _decode_record(record, name_to_features),
                batch_size=batch_size,
                drop_remainder=drop_remainder))
    
        return d
    
      return input_fn

    这个函数返回一个函数input_fn。这个input_fn函数首先从文件得到TFRecordDataset,然后根据是否训练来shuffle和重复读取。然后用applay函数对每一个TFRecord进行map_and_batch,调用_decode_record函数对record进行parsing。从而把TFRecord的一条Record变成tf.Example对象,这个对象包括了input_ids等4个用于训练的Tensor。

    参考:

    https://blog.csdn.net/jiaowoshouzi/article/details/89388794

  • 相关阅读:
    vue报错 error: data.push is not a function
    vue elment.style样式修改(第三方组件自生成元素)
    按元素标签查询多个
    按css查询多个元素
    按CSS查询一个元素
    查询单个元素
    JavaScript 查找元素
    Spring 商品分类
    Spring 使用日志
    Spring 使用日期类型
  • 原文地址:https://www.cnblogs.com/xiximayou/p/14020494.html
Copyright © 2011-2022 走看看