torchtext是pytorch自带的关于文本的处理工具。
torchtext支持的分词器
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('basic_english')
在/Users/xuehuiping/anaconda3/envs/my_transformer/lib/python3.7/site-packages/torchtext/data/utils.py查看get_tokenizer的定义:
def get_tokenizer(tokenizer, language='en')
tokenizer可以是:
tokenizer取值 | 分词说明 | |
---|---|---|
None | 无效 | |
basic_english | language只能是en | |
spacy | spacy = spacy.load(language) | |
moses | from sacremoses import MosesTokenizer moses_tokenizer = MosesTokenizer() return moses_tokenizer.tokenize |
|
toktok | from nltk.tokenize.toktok import ToktokTokenizer toktok = ToktokTokenizer() return toktok.tokenize |
|
revtok | import revtok return revtok.tokenize |
|
subword | import revtok return partial(revtok.tokenize, decap=True) |