实验对比了一下三种切分方式:
1,2 : nltk.word_tokenize : 分离缩略词,(“Don't” =>'Do', "n't") 表句子切分的“,” "." 单独成词。
3 : TreebankWordTokenizer: 分离缩略词, 表句子切分的 “,"单独成词,句号“.”被删去。
4 : PunktWordTokenizer: 报错: cannot import name 'PunktWordTokenizer'
5 : WordPunctTokenizer: 将标点转化为全新标识符实现切分。(“Don't” =>'Don', "'", 't')
1 import nltk 2 text = "We're excited to let you know that. Harry, 18 years old, will join us on Nov. 29. Don't tell him." 3 4 text_tokenized = nltk.word_tokenize(text) 5 print("1: word_tokenize:", text_tokenized) 6 print("length: ", len(text_tokenized)) 7 8 from nltk import word_tokenize 9 text_tokenized_2 = word_tokenize(text) 10 print("2: word_tokenize:", text_tokenized_2) 11 print("length: ", len(text_tokenized_2)) 12 13 from nltk.tokenize import TreebankWordTokenizer 14 tokenizer3 = TreebankWordTokenizer() 15 text_tokenized_3 = tokenizer3.tokenize(text) 16 print("3: TreebankWordTokenizer", text_tokenized_3) 17 print("length: ", len(text_tokenized_3)) 18 19 # from nltk.tokenize import PunktWordTokenizer 20 # tokenizer4 = PunktWordTokenizer() 21 # text_tokenized_4 = tokenizer4.tokenize(text) 22 # print("4: PunktWordTokenizer", text_tokenized_4) 23 # print("length: ", len(text_tokenized_4)) 24 25 from nltk.tokenize import WordPunctTokenizer 26 tokenizer5 = WordPunctTokenizer() 27 text_tokenized_5 = tokenizer5.tokenize(text) 28 print("5: WordPunctTokenizer", text_tokenized_5) 29 print("length: ", len(text_tokenized_5))
输出:
1 1: word_tokenize: ['We', "'re", 'excited', 'to', 'let', 'you', 'know', 'that', '.', 'Harry', ',', '18', 'years', 'old', ',', 'will', 'join', 'us', 'on', 'Nov.', '29', '.', 'Do', "n't", 'tell', 'him', '.'] 2 length: 27 3 2: word_tokenize: ['We', "'re", 'excited', 'to', 'let', 'you', 'know', 'that', '.', 'Harry', ',', '18', 'years', 'old', ',', 'will', 'join', 'us', 'on', 'Nov.', '29', '.', 'Do', "n't", 'tell', 'him', '.'] 4 length: 27 5 3: TreebankWordTokenizer ['We', "'re", 'excited', 'to', 'let', 'you', 'know', 'that.', 'Harry', ',', '18', 'years', 'old', ',', 'will', 'join', 'us', 'on', 'Nov.', '29.', 'Do', "n't", 'tell', 'him', '.'] 6 length: 25 7 5: WordPunctTokenizer ['We', "'", 're', 'excited', 'to', 'let', 'you', 'know', 'that', '.', 'Harry', ',', '18', 'years', 'old', ',', 'will', 'join', 'us', 'on', 'Nov', '.', '29', '.', 'Don', "'", 't', 'tell', 'him', '.'] 8 length: 30