nltk RegexpTokenizer类:python自然语言处理 - 走看看

zoukankan html css js c++ java

nltk RegexpTokenizer类:python自然语言处理

前面的一些分词工具都是写好的的规则

如果我们想按照自己的规则进行分词可以使用正则分词器

1.RegexpTokenizer类

from nltk.tokenize import RegexpTokenizer

text = " I won't just survive, Oh, you will see me thrive. Can't write my story,I'm beyond the archetype."

# 实例化RegexpTokenizer 会按照正则表达式进行re.findall()
regexp_tokenizer = RegexpTokenizer(pattern="w+")
# 实例化RegexpTokenizer 指定gaps=True会按照正则表达式进行re.split()
regexp_tokenizer1 = RegexpTokenizer("[s,'.]", gaps=True)
print(regexp_tokenizer.tokenize(text))
# ['I', 'won', 't', 'just', 'survive', 'Oh', 'you', 'will', 'see', 'me', 'thrive', 'Can', 't', 'write', 'my', 'story', 'I', 'm', 'beyond', 'the', 'archetype']
print(regexp_tokenizer1.tokenize(text))
# ['I', 'won', 't', 'just', 'survive', 'Oh', 'you', 'will', 'see', 'me', 'thrive', 'Can', 't', 'write', 'my', 'story', 'I', 'm', 'beyond', 'the', 'archetype']
---------------------
作者：qq_41864652
来源：CSDN
原文：https://blog.csdn.net/qq_41864652/article/details/81505768
版权声明：本文为博主原创文章，转载请附上博文链接！

查看全文

相关阅读:
PAT 甲级 1126 Eulerian Path (25 分)
PAT 甲级 1126 Eulerian Path (25 分)
PAT 甲级 1125 Chain the Ropes (25 分)
PAT 甲级 1125 Chain the Ropes (25 分)
PAT 甲级 1124 Raffle for Weibo Followers (20 分)
PAT 甲级 1124 Raffle for Weibo Followers (20 分)
PAT 甲级 1131 Subway Map (30 分)
PAT 甲级 1131 Subway Map (30 分)
AcWing 906. 区间分组区间贪心
 AcWing 907. 区间覆盖区间贪心

原文地址：https://www.cnblogs.com/btschang/p/10237021.html

Copyright © 2011-2022 走看看