zoukankan      html  css  js  c++  java
  • [精通Python自然语言处理] Ch1

    实验对比了一下三种切分方式:

    1,2 : nltk.word_tokenize :  分离缩略词,(“Don't” =>'Do', "n't") 表句子切分的“,” "." 单独成词。

    3 :  TreebankWordTokenizer: 分离缩略词, 表句子切分的 “,"单独成词,句号“.”被删去。

    4 : PunktWordTokenizer: 报错: cannot import name 'PunktWordTokenizer'

    5 : WordPunctTokenizer: 将标点转化为全新标识符实现切分。(“Don't” =>'Don', "'", 't')

     1 import nltk
     2 text = "We're excited to let you know that. Harry, 18 years old, will join us on Nov. 29. Don't tell him."
     3 
     4 text_tokenized = nltk.word_tokenize(text)
     5 print("1: word_tokenize:", text_tokenized)
     6 print("length: ", len(text_tokenized))
     7 
     8 from nltk import word_tokenize
     9 text_tokenized_2 = word_tokenize(text)
    10 print("2: word_tokenize:", text_tokenized_2)
    11 print("length: ", len(text_tokenized_2))
    12 
    13 from nltk.tokenize import TreebankWordTokenizer
    14 tokenizer3 = TreebankWordTokenizer()
    15 text_tokenized_3 = tokenizer3.tokenize(text)
    16 print("3: TreebankWordTokenizer", text_tokenized_3)
    17 print("length: ", len(text_tokenized_3))
    18 
    19 # from nltk.tokenize import PunktWordTokenizer
    20 # tokenizer4 = PunktWordTokenizer()
    21 # text_tokenized_4 = tokenizer4.tokenize(text)
    22 # print("4: PunktWordTokenizer", text_tokenized_4)
    23 # print("length: ", len(text_tokenized_4))
    24 
    25 from nltk.tokenize import WordPunctTokenizer
    26 tokenizer5 = WordPunctTokenizer()
    27 text_tokenized_5 = tokenizer5.tokenize(text)
    28 print("5: WordPunctTokenizer", text_tokenized_5)
    29 print("length: ", len(text_tokenized_5))

    输出:

    1 1: word_tokenize: ['We', "'re", 'excited', 'to', 'let', 'you', 'know', 'that', '.', 'Harry', ',', '18', 'years', 'old', ',', 'will', 'join', 'us', 'on', 'Nov.', '29', '.', 'Do', "n't", 'tell', 'him', '.']
    2 length:  27
    3 2: word_tokenize: ['We', "'re", 'excited', 'to', 'let', 'you', 'know', 'that', '.', 'Harry', ',', '18', 'years', 'old', ',', 'will', 'join', 'us', 'on', 'Nov.', '29', '.', 'Do', "n't", 'tell', 'him', '.']
    4 length:  27
    5 3: TreebankWordTokenizer ['We', "'re", 'excited', 'to', 'let', 'you', 'know', 'that.', 'Harry', ',', '18', 'years', 'old', ',', 'will', 'join', 'us', 'on', 'Nov.', '29.', 'Do', "n't", 'tell', 'him', '.']
    6 length:  25
    7 5: WordPunctTokenizer ['We', "'", 're', 'excited', 'to', 'let', 'you', 'know', 'that', '.', 'Harry', ',', '18', 'years', 'old', ',', 'will', 'join', 'us', 'on', 'Nov', '.', '29', '.', 'Don', "'", 't', 'tell', 'him', '.']
    8 length:  30

  • 相关阅读:
    招隐-古琴曲-山中鸣琴,万籁声沉沉,何泠泠!
    因循苟且逸豫而无为,可以侥幸一时,而不可以旷日持久。——王安石
    模糊理论在图像处理中的应用
    铁关-中国首都警官合唱团-歌词
    听着总感觉莫名熟悉的音乐汇总
    石鼓歌-韩愈
    唐长安城
    唐长安的信仰——读书笔记
    Eclipse安装java web插件
    Java调用MySql数据库函数
  • 原文地址:https://www.cnblogs.com/shiyublog/p/10130088.html
Copyright © 2011-2022 走看看