zoukankan      html  css  js  c++  java
  • 关于bertTokenizer

    具体实例

    from transformers import BertTokenizer
    import os
    
    tokens = ['我','爱','北','京','天','安','门']
    
    tokenizer = BertTokenizer(os.path.join('/content/drive/MyDrive/simpleNLP/model_hub/bert-base-case','vocab.txt'))
    encode_dict = tokenizer.encode_plus(text=tokens,
                      max_length=256,
                      pad_to_max_length=True,
                      is_pretokenized=True,
                      return_token_type_ids=True,
                      return_attention_mask=True)
    tokens = ['[CLS]'] + tokens + ['[SEP]']
    print(' '.join(tokens))
    print(encode_dict['input_ids'])
    

    结果:

    Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
    [CLS] 我 爱 北 京 天 安 门 [SEP]
    [101, 100, 100, 993, 984, 1010, 1016, 100, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py:2079: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
      FutureWarning,
    
    from transformers import BertTokenizer
    import os
    
    tokens = ['我','爱','北','京','天','安','门']
    
    tokenizer = BertTokenizer(os.path.join('/content/drive/MyDrive/simpleNLP/model_hub/bert-base-case','vocab.txt'))
    tokens_a = '我 爱 北 京 天 安 门'.split(' ')
    tokens_b = '我 爱 打 英 雄 联 盟 啊 啊'.split(' ')
    
    encode_dict = tokenizer.encode_plus(text=tokens_a,
                      text_pair=tokens_b,
                      max_length=20,
                      pad_to_max_length=True,
                      truncation_strategy='only_second',
                      is_pretokenized=True,
                      return_token_type_ids=True,
                      return_attention_mask=True)
    tokens = " ".join(['[CLS]'] + tokens_a + ['[SEP]'] + tokens_b + ['[SEP]'])
    token_ids = encode_dict['input_ids']
    attention_masks = encode_dict['attention_mask']
    token_type_ids = encode_dict['token_type_ids']
    
    print(tokens)
    print(token_ids)
    print(attention_masks)
    print(token_type_ids)
    

    结果:

    Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
    [CLS] 我 爱 北 京 天 安 门 [SEP] 我 爱 打 英 雄 联 盟 啊 啊 [SEP]
    [101, 100, 100, 993, 984, 1010, 1016, 100, 102, 100, 100, 100, 100, 100, 100, 100, 100, 100, 102, 0]
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
    /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py:2079: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
      FutureWarning,
    
  • 相关阅读:
    IDEA 配置Springboot项目热部署
    一文读懂类加载机制
    面试必问的MySQL锁与事务隔离级别
    工作中遇到的99%SQL优化,这里都能给你解决方案(三)
    谁有好的oracle数据库学习书籍,麻烦提供一下,感激不尽
    静态资源上传至远程ftp服务器,ftp工具类封装
    进程和线程,并发和并行,同步和异步,高并发和多线程,理一理概念
    使用springboot集成腾讯云短信服务,解决配置文件读取乱码问题
    曾经天真的以为单例只有懒汉和饿汉两种!原来单例模式还能被破解!!!
    了解一下zookeeper,搭建单机版和集群版的环境玩玩,需要手稿的,留下邮箱
  • 原文地址:https://www.cnblogs.com/xiximayou/p/14880999.html
Copyright © 2011-2022 走看看