zoukankan      html  css  js  c++  java
  • 自然语言处理----词干提取器

    这里主要介绍nltk中的一些现成的词干提取器Porter和Lancaster.

    1. Porter

    >>> import nltk
    >>> porter=nltk.PorterStemmer()
    >>> raw='''Listen, strange women lying in ponds distributing swords is no basis
    ... for a system of government. Supreme executive power derives from a mandate from
    ... the masses, not from some farcical aquatic'''
    >>> tokens=nltk.word_tokenize(raw)
    >>> [porter.stem(t) for t in tokens]
    ['listen', ',', u'strang', 'women', u'lie', 'in', u'pond', u'distribut', u'sword', 'is', 'no', u'basi', 'for', 'a', 'system', 'of', u'govern', '.', u'suprem', u'execut', 'power', u'deriv', 'from',
    , u'mandat', 'from', 'the', u'mass', ',', 'not', 'from', 'some', u'farcic', u'aquat']

    2. Lancaster

    >>> lancaster=nltk.LancasterStemmer()
    >>> [lancaster.stem(t) for t in tokens]
    ['list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from'
    , 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu']

    3. 词形归并器:删除词缀产生的词, 常用的有WordNetLemmatier

    >>> wnl=nltk.WordNetLemmatizer()
    >>> [wnl.lemmatize(t) for t in tokens]
    ['Listen', ',', 'strange', u'woman', 'lying', 'in', u'pond', 'distributing', u'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', '
    from', 'a', 'mandate', 'from', 'the', u'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic']

    从上面的运行结果可以看出,Porter词干提取器的效果比较好。

    4. 基于Porter词干提取算法的词干提取工具SnowballStemmer

    >>> from nltk.stem import SnowballStemmer
    >>> stemmer=SnowballStemmer('english')
    >>> import nltk
    >>> raw='''Listen, strange women lying in ponds distributing swords is no basis
    ... ... for a system of government. Supreme executive power derives from a mandate from
    ... ... the masses, not from some farcical aquatic'''
    >>> tokens=nltk.word_tokenize(raw)
    >>> [stemmer.stem(t) for t in tokens]
    [u'listen', ',', u'strang', u'women', u'lie', 'in', u'pond', u'distribut', u'sword', 'is', 'no', u'basi', u'...', u'for', 'a', u'system', 'of', u'govern', '.', u'suprem', u'execut', u'power', u'deriv'
    , u'from', 'a', u'mandat', u'from', u'...', u'the', u'mass', ',', u'not', u'from', u'some', u'farcic', u'aquat']
  • 相关阅读:
    计算机网络复习(二) 应用层
    JavaScript实战笔记(二) 数组去重
    计算机网络复习(一) 基本介绍
    计算机网络复习
    Git学习笔记(一) 常用命令
    Git学习笔记
    Python实战笔记(三) 多线程
    Python实战笔记(二) 网络编程
    Python学习笔记
    XBox360自制系统的更新(Update)
  • 原文地址:https://www.cnblogs.com/no-tears-girl/p/6964910.html
Copyright © 2011-2022 走看看