zoukankan      html  css  js  c++  java
  • sentencepiece 学习笔记

    简介

    最近在看 speechbrain 语音识别项目,其中第一步就是对文本标签进行 tokenization 了,各种参数看得云里雾里的,现在系统

    总结 googel的 sentencepiece 的使用。

    参考:https://github.com/google/sentencepiece

    一、安装

    pip install sentencepiece

    二、支持的切词方法

     三、python 接口的使用

    import sentencepiece as spm
    # Model Training
    '''
    --input: one-sentence-per-line raw corpus file. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes the input with Unicode NFKC. You can pass a comma-separated list of files.
    --model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
    --vocab_size: vocabulary size, e.g., 8000, 16000, or 32000
    --character_coverage: amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
    --model_type: model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.
    '''
    # 一些特殊字符的处理
    '''
    1. By default, SentencePiece uses Unknown (<unk>), BOS (<s>) and EOS (</s>) tokens which have the ids of 0, 1, and 2 respectively
    2. We can redefine this mapping in the training phase as follows. -bos_id=0 --eos_id=1 --unk_id=5
    3. When setting -1 id e.g., bos_id=-1, this special token is disabled. Note that the unknow id cannot be disabled. We can define an id for padding (<pad>) as --pad_id=3.  
    '''
    spm.SentencePieceTrainer.Train(input='botchan.txt', model_prefix='m', model_type="unigram", vocab_size=1000) # 在当前目录下生成 m.model 和 m.vocab 文件
    
    # 加载训练好的模型,切分文本
    sp = spm.SentencePieceProcessor(model_file='m.model')
    
    # 编码 text -> id
    result = sp.encode(['This is a test', 'Hello world'], out_type=int)
    print(result)
    result = sp.encode(['This is a test', 'Hello world'], out_type=str)
    print(result)
    
    # 解码 id -> text
    result = sp.decode([285, 46, 10, 170, 382])
    print(result)
    result = sp.decode(['▁This', '▁is', '▁a', '▁t', 'est'])
    print(result)
    
    # 采样
    for _ in range(10):
        result = sp.encode('This is a test', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)
        print(result)
    
    # 其它常用方法
    sp.get_piece_size()
    sp.id_to_piece(2)
    sp.id_to_piece([2, 3, 4])
    sp.piece_to_id('<s>')
    sp.piece_to_id(['</s>', '
    ', ''])
  • 相关阅读:
    CentOS虚拟机和物理机共享文件夹实现
    集训第六周 数学概念与方法 概率 数论 最大公约数 G题
    集训第六周 数学概念与方法 概率 F题
    集训第六周 E题
    集训第六周 古典概型 期望 D题 Discovering Gold 期望
    集训第六周 古典概型 期望 C题
    集训第六周 数学概念与方法 UVA 11181 条件概率
    集训第六周 数学概念与方法 UVA 11722 几何概型
    DAG模型(矩形嵌套)
    集训第五周 动态规划 K题 背包
  • 原文地址:https://www.cnblogs.com/hypnus-ly/p/15311847.html
Copyright © 2011-2022 走看看