zoukankan      html  css  js  c++  java
  • Python nltk English Detection

    http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/

    >>> from nltk import wordpunct_tokenize

    >>> wordpunct_tokenize("That's thirty minutes away. I'll be there in ten.")

    ['That', "'", 's', 'thirty', 'minutes', 'away', '.', 'I', "'", 'll', 'be', 'there', 'in', 'ten', '.']

    >>> from nltk.corpus import stopwords

    >>> stopwords.fileids()

    ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']

    >>>

    >>> stopwords.words('english')[0:10]

    ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

    >>> languages_ratios = {}

    >>>

    >>> tokens = wordpunct_tokenize(text)

    >>> words = [word.lower() for word in tokens]

    >>> for language in stopwords.fileids():

    ... stopwords_set = set(stopwords.words(language))

    ... words_set = set(words)

    ... common_elements = words_set.intersection(stopwords_set)

    ...

    ... languages_ratios[language] = len(common_elements)

    # language "score"

    >>>

    >>> languages_ratios

    {'swedish': 1, 'danish': 1, 'hungarian': 2, 'finnish': 0, 'portuguese': 0, 'german': 1, 'dutch': 1, 'french': 1, 'spanish': 0, 'norwegian': 1, 'english': 6, 'russian': 0, 'turkish': 0, 'italian': 2}

    >>> most_rated_language = max(languages_ratios, key=languages_ratios.get)

    >>> most_rated_language

    'english'

  • 相关阅读:
    关于本博客
    洛谷P3387 【模板】缩点 题解
    spfa学习笔记
    Google Chrome Download
    Kosaraju算法学习
    fhq treap 学习笔记
    OIerChat
    python request.get(h.html),用xpath获取数据为空
    k8s 用ingress暴露集群环境中的服务。
    harbor push 报received unexpected HTTP status: 500 Internal Server Error
  • 原文地址:https://www.cnblogs.com/turtle920/p/5597829.html
Copyright © 2011-2022 走看看