zoukankan      html  css  js  c++  java
  • R学习之——R用于文本挖掘(tm包)

    首先需要安装并加载tm包。


    1、读取文本

    x = readLines("222.txt")

    2、建立语料库

     > r=Corpus(VectorSource(x))
    
     > r
    
     A corpus with 7012 text documents

    3、语料库输出,保存到硬盘

    > writeCorpus(r)

    4、查看语料库

    > print(r)
    A corpus with 7012 text documents
    > summary(r)
    A corpus with 7012 text documents
    
    The metadata consists of 2 tag-value pairs and a data frame
    Available tags are:
      create_date creator 
    Available variables in the data frame are:
      MetaID 

      > inspect(r[2:2])
      A corpus with 1 text document

      The metadata consists of 2 tag-value pairs and a data frame
      Available tags are:
      create_date creator
      Available variables in the data frame are:
      MetaID

      [[1]]
      Female; Genital Neoplasms, Female/*therapy; Humans

      > r[[2]]
      Female; Genital Neoplasms, Female/*therapy; Humans

    5、建立“文档-词”矩阵

    > dtm = DocumentTermMatrix(r)
    > head(dtm)
    A document-term matrix (6 documents, 16381 terms)
    
    Non-/sparse entries: 110/98176
    Sparsity           : 100%
    Maximal term length: 81 
    Weighting          : term frequency (tf)

    6、查看“文档-词”矩阵

    > inspect(dtm[1:2,1:4])

    7、查找出现200次以上的词

    > findFreqTerms(dtm,200)
     [1] "acute"          "adjuvant"       "advanced"       "after"         
     [5] "and"            "breast"         "cancer"         "cancer:"       
     [9] "carcinoma"      "cell"           "chemotherapy"   "clinical"      
    [13] "colorectal"     "factor"         "for"            "from"          
    [17] "group"          "growth"         "iii"            "leukemia"      
    [21] "lung"           "lymphoma"       "metastatic"     "non-small-cell"
    [25] "oncology"       "patients"       "phase"          "plus"          
    [29] "prostate"       "randomized"     "receptor"       "response"      
    [33] "results"        "risk"           "study"          "survival"      
    [37] "the"            "therapy"        "treatment"      "trial"         
    [41] "tumor"          "with"          

    7、移除出现次数较少的词

    inspect(removeSparseTerms(dtm, 0.4))

    8、查找和“stem”的相关系数在0.5以上的词

    > findAssocs(dtm, "stem", 0.5)
     stem cells 
     1.00  0.61 

     9、计算文档相似度(用cosine计算距离)

    > dist_dtm <- dissimilarity(dtm, method = 'cosine')
    > head(dist_dtm)
    [1] 1.0000000 0.7958759 0.8567770 0.9183503 0.9139337 0.9309934

    10、聚类

    > hc <- hclust(dist_dtm, method = 'ave')
    > plot(hc,xlab='')

         

  • 相关阅读:
    Zookeeper实现配置中心
    Springboot
    分布式事务框架
    Spring 事件Application Event
    了解“事务机制” 过程
    mysql : 行锁,表锁,共享锁,排他锁,悲观锁,乐观锁
    数据库隔离级别和实现原理
    aop用于日志
    exceptionHandler統一处理异常
    mysql 点总结
  • 原文地址:https://www.cnblogs.com/todoit/p/2589741.html
Copyright © 2011-2022 走看看