zoukankan      html  css  js  c++  java
  • 自然语言2_常用函数

    sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频教程)

    https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

    相同爱好者请加

    QQ:231469242

    seo 关键词

    自然语言,NLP,nltk,python,tokenization,normalization,linguistics,semantic

    学习参考书: http://nltk.googlecode.com/svn/trunk/doc/book/

    http://blog.csdn.net/tanzhangwen/article/details/8469491

    一个NLP爱好者博客

    http://blog.csdn.net/tanzhangwen/article/category/1297154

    1. 使用代理下载数据

    nltk.set_proxy("**.com:80")

    nltk.download()


    2. 使用sents(fileid)函数时候出现:Resource 'tokenizers/punkt/english.pickle' not found.  Please use the NLTK Downloader to obtain the resource:

    import nltk

    nltk.download()

    安装窗口中选择'Models'项,然后'在 'Identifier' 列找 'punkt,点击下载安装该数据包


    3. 语料Corpus元素获取函数

    from nltk.corpus import webtext

    webtext.fileids()      #得到语料中所有文件的id集合

    webtext.raw(fileid)  #给定文件的所有字符集合

    webtext.words(fileid) #所有单词集合

    webtext.sents(fileid)  #所有句子集合

    ExampleDescription
    fileids() the files of the corpus
    fileids([categories]) the files of the corpus corresponding to these categories
    categories() the categories of the corpus
    categories([fileids]) the categories of the corpus corresponding to these files
    raw() the raw content of the corpus
    raw(fileids=[f1,f2,f3]) the raw content of the specified files
    raw(categories=[c1,c2]) the raw content of the specified categories
    words() the words of the whole corpus
    words(fileids=[f1,f2,f3]) the words of the specified fileids
    words(categories=[c1,c2]) the words of the specified categories
    sents() the sentences of the whole corpus
    sents(fileids=[f1,f2,f3]) the sentences of the specified fileids
    sents(categories=[c1,c2]) the sentences of the specified categories
    abspath(fileid) the location of the given file on disk
    encoding(fileid) the encoding of the file (if known)
    open(fileid) open a stream for reading the given corpus file
    root() the path to the root of locally installed corpus
    readme() the contents of the README file of the corpus

    4.文本处理的一些常用函数

    假若text是单词集合的列表

    len(text)  #单词个数

    set(text)  #去重

    sorted(text) #排序

    text.count('a') #数给定的单词的个数

    text.index('a') #给定单词首次出现的位置

    FreqDist(text) #单词及频率,keys()为单词,*[key]得到值

    FreqDist(text).plot(50,cumulative=True) #画累积图

    bigrams(text) #所有的相邻二元组

    text.collocations() #找文本中频繁相邻二元组

    text.concordance("word") #找给定单词出现的位置及上下文

    text.similar("word") #找和给定单词语境相似的所有单词   ???

    text.common_context("a“,"b") #找两个单词相似的上下文语境

    text.dispersion_plot(['a','b','c',...]) #单词在文本中的位置分布比较图

    text.generate() #随机产生一段文本


    NLTK's Conditional Frequency Distributions: commonly-used methods and idioms for defining,accessing, and visualizing a conditional frequency distribution.of counters.

    ExampleDescription
    cfdist = ConditionalFreqDist(pairs) create a conditional frequency distribution from a list of pairs
    cfdist.conditions() alphabetically sorted list of conditions
    cfdist[condition] the frequency distribution for this condition
    cfdist[condition][sample] frequency for the given sample for this condition
    cfdist.tabulate() tabulate the conditional frequency distribution
    cfdist.tabulate(samples, conditions) tabulation limited to the specified samples and conditions
    cfdist.plot() graphical plot of the conditional frequency distribution
    cfdist.plot(samples, conditions) graphical plot limited to the specified samples and conditions
    cfdist1 < cfdist2 test if samples in cfdist1 occur less frequently than in cfdist2


  • 相关阅读:
    如何向线程传递参数
    IntelliJ IDEA 13 Keygen
    单链表的基本操作
    顺序表静态查找
    有向图的十字链表表存储表示
    BF-KMP 算法
    图的邻接表存储表示(C)
    二叉树的基本操作(C)
    VC远控(三)磁盘显示
    Android 数独游戏 记录
  • 原文地址:https://www.cnblogs.com/webRobot/p/6058205.html
Copyright © 2011-2022 走看看