zoukankan      html  css  js  c++  java
  • torchtext支持的分词器

    torchtext是pytorch自带的关于文本的处理工具。

    torchtext支持的分词器

    from torchtext.data.utils import get_tokenizer
    
    tokenizer = get_tokenizer('basic_english')
    

    在/Users/xuehuiping/anaconda3/envs/my_transformer/lib/python3.7/site-packages/torchtext/data/utils.py查看get_tokenizer的定义:

    def get_tokenizer(tokenizer, language='en')
    

    tokenizer可以是:

    tokenizer取值 分词说明
    None 无效
    basic_english language只能是en
    spacy spacy = spacy.load(language)
    moses from sacremoses import MosesTokenizer
    moses_tokenizer = MosesTokenizer()
    return moses_tokenizer.tokenize
    toktok from nltk.tokenize.toktok import ToktokTokenizer
    toktok = ToktokTokenizer()
    return toktok.tokenize
    revtok import revtok
    return revtok.tokenize
    subword import revtok
    return partial(revtok.tokenize, decap=True)
  • 相关阅读:
    Python基础Day2
    HDU
    HDU
    BZOJ
    Gym
    UVA
    UVA
    UVA
    UVA
    BZOJ
  • 原文地址:https://www.cnblogs.com/xuehuiping/p/15343250.html
Copyright © 2011-2022 走看看