zoukankan      html  css  js  c++  java
  • Python nltk English Detection

    http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/

    >>> from nltk import wordpunct_tokenize

    >>> wordpunct_tokenize("That's thirty minutes away. I'll be there in ten.")

    ['That', "'", 's', 'thirty', 'minutes', 'away', '.', 'I', "'", 'll', 'be', 'there', 'in', 'ten', '.']

    >>> from nltk.corpus import stopwords

    >>> stopwords.fileids()

    ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']

    >>>

    >>> stopwords.words('english')[0:10]

    ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

    >>> languages_ratios = {}

    >>>

    >>> tokens = wordpunct_tokenize(text)

    >>> words = [word.lower() for word in tokens]

    >>> for language in stopwords.fileids():

    ... stopwords_set = set(stopwords.words(language))

    ... words_set = set(words)

    ... common_elements = words_set.intersection(stopwords_set)

    ...

    ... languages_ratios[language] = len(common_elements)

    # language "score"

    >>>

    >>> languages_ratios

    {'swedish': 1, 'danish': 1, 'hungarian': 2, 'finnish': 0, 'portuguese': 0, 'german': 1, 'dutch': 1, 'french': 1, 'spanish': 0, 'norwegian': 1, 'english': 6, 'russian': 0, 'turkish': 0, 'italian': 2}

    >>> most_rated_language = max(languages_ratios, key=languages_ratios.get)

    >>> most_rated_language

    'english'

  • 相关阅读:
    Markdown语法
    Hello World
    sql笔试题-1
    解决高版本vm打开虚拟机报错
    zookeeper启动闪退
    java找出1~1000中素数的三种方式
    Java中更精确的计时
    vue系列之调试工具(vue-devtools)
    vue系列之npm命令错误
    vue系列之安装基础环境
  • 原文地址:https://www.cnblogs.com/turtle920/p/5597829.html
Copyright © 2011-2022 走看看