zoukankan      html  css  js  c++  java
  • SK-Learn使用NMF(非负矩阵分解)和LDA(隐含狄利克雷分布)进行话题抽取

    英文链接:http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html

    这是一个使用NMF和LDA对一个语料集进行话题抽取的例子。

    输入分别是是tf-idf矩阵(NMF)和tf矩阵(LDA)。

    输出是一系列的话题,每个话题由一系列的词组成。

    默认的参数(n_samples/n_features/n_topics)会使这个例子运行数十秒。

    你可以尝试修改问题的规模,但是要注意,NMF的时间复杂度是多项式级别的,LDA的时间复杂度与(n_samples*iterations)成正比。

    几点注意事项:

    (1)其中line 61的代码需要注释掉,才能看到输出结果。

    (2)第一次运行代码,程序会从网上下载新闻数据,然后保存在一个缓存目录中,之后再运行代码,就不会重复下载了。

    (3)关于NMF和LDA的参数设置,可以到sklearn的官网上查看【NMF官方文档】【LDA官方文档】。

    (4)该代码对应的sk-learn版本为 scikit-learn 0.17.1

    代码:

     1 # Author: Olivier Grisel <olivier.grisel@ensta.org>
     2 #         Lars Buitinck <L.J.Buitinck@uva.nl>
     3 #         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
     4 # License: BSD 3 clause
     5 
     6 from __future__ import print_function
     7 from time import time
     8 
     9 from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
    10 from sklearn.decomposition import NMF, LatentDirichletAllocation
    11 from sklearn.datasets import fetch_20newsgroups
    12 
    13 n_samples = 2000
    14 n_features = 1000
    15 n_topics = 10
    16 n_top_words = 20
    17 
    18 
    19 def print_top_words(model, feature_names, n_top_words):
    20     for topic_idx, topic in enumerate(model.components_):
    21         print("Topic #%d:" % topic_idx)
    22         print(" ".join([feature_names[i]
    23                         for i in topic.argsort()[:-n_top_words - 1:-1]]))
    24     print()
    25 
    26 
    27 # Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
    28 # to filter out useless terms early on: the posts are stripped of headers,
    29 # footers and quoted replies, and common English words, words occurring in
    30 # only one document or in at least 95% of the documents are removed.
    31 
    32 print("Loading dataset...")
    33 t0 = time()
    34 dataset = fetch_20newsgroups(shuffle=True, random_state=1,
    35                              remove=('headers', 'footers', 'quotes'))
    36 data_samples = dataset.data
    37 print("done in %0.3fs." % (time() - t0))
    38 
    39 # Use tf-idf features for NMF.
    40 print("Extracting tf-idf features for NMF...")
    41 tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, #max_features=n_features,
    42                                    stop_words='english')
    43 t0 = time()
    44 tfidf = tfidf_vectorizer.fit_transform(data_samples)
    45 print("done in %0.3fs." % (time() - t0))
    46 
    47 # Use tf (raw term count) features for LDA.
    48 print("Extracting tf features for LDA...")
    49 tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,
    50                                 stop_words='english')
    51 t0 = time()
    52 tf = tf_vectorizer.fit_transform(data_samples)
    53 print("done in %0.3fs." % (time() - t0))
    54 
    55 # Fit the NMF model
    56 print("Fitting the NMF model with tf-idf features,"
    57       "n_samples=%d and n_features=%d..."
    58       % (n_samples, n_features))
    59 t0 = time()
    60 nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
    61 exit()
    62 print("done in %0.3fs." % (time() - t0))
    63 
    64 print("
    Topics in NMF model:")
    65 tfidf_feature_names = tfidf_vectorizer.get_feature_names()
    66 print_top_words(nmf, tfidf_feature_names, n_top_words)
    67 
    68 print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
    69       % (n_samples, n_features))
    70 lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
    71                                 learning_method='online', learning_offset=50.,
    72                                 random_state=0)
    73 t0 = time()
    74 lda.fit(tf)
    75 print("done in %0.3fs." % (time() - t0))
    76 
    77 print("
    Topics in LDA model:")
    78 tf_feature_names = tf_vectorizer.get_feature_names()
    79 print_top_words(lda, tf_feature_names, n_top_words)

    结果:

    Loading dataset...
    done in 2.222s.
    Extracting tf-idf features for NMF...
    done in 2.730s.
    Extracting tf features for LDA...
    done in 2.702s.
    Fitting the NMF model with tf-idf features,n_samples=2000 and n_features=1000...
    done in 1.904s.
    
    Topics in NMF model:
    Topic #0:
    don just people think like know good time right ve say did make really way want going new year ll
    Topic #1:
    windows thanks file card does dos mail files know program use advance hi window help software looking ftp video pc
    Topic #2:
    drive scsi ide drives disk controller hard floppy bus hd cd boot mac cable card isa rom motherboard mb internal
    Topic #3:
    key chip encryption clipper keys escrow government algorithm security secure encrypted public nsa des enforcement law privacy bit use secret
    Topic #4:
    00 sale 50 shipping 20 10 price 15 new 25 30 dos offer condition 40 cover asking 75 01 interested
    Topic #5:
    armenian armenians turkish genocide armenia turks turkey soviet people muslim azerbaijan russian greek argic government serdar kurds population ottoman million
    Topic #6:
    god jesus bible christ faith believe christians christian heaven sin life hell church truth lord does say belief people existence
    Topic #7:
    mouse driver keyboard serial com1 port bus com3 irq button com sys microsoft ball problem modem adb drivers card com2
    Topic #8:
    space nasa shuttle launch station sci gov orbit moon earth lunar satellite program mission center cost research data solar mars
    Topic #9:
    msg food chinese flavor eat glutamate restaurant foods reaction taste restaurants salt effects carl brain people ingredients natural causes olney
    
    Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
    done in 22.548s.
    
    Topics in LDA model:
    Topic #0:
    government people mr law gun state president states public use right rights national new control american security encryption health united
    Topic #1:
    drive card disk bit scsi use mac memory thanks pc does video hard speed apple problem used data monitor software
    Topic #2:
    said people armenian armenians turkish did saw went came women killed children turkey told dead didn left started greek war
    Topic #3:
    year good just time game car team years like think don got new play games ago did season better ll
    Topic #4:
    10 00 15 25 12 11 20 14 17 16 db 13 18 24 30 19 27 50 21 40
    Topic #5:
    windows window program version file dos use files available display server using application set edu motif package code ms software
    Topic #6:
    edu file space com information mail data send available program ftp email entry info list output nasa address anonymous internet
    Topic #7:
    ax max b8f g9v a86 pl 145 1d9 0t 34u 1t 3t giz bhj wm 2di 75u 2tm bxn 7ey
    Topic #8:
    god people jesus believe does say think israel christian true life jews did bible don just know world way church
    Topic #9:
    don know like just think ve want does use good people key time way make problem really work say need
  • 相关阅读:
    解析大型.NET ERP系统 权限模块设计与实现
    Enterprise Solution 开源项目资源汇总 Visual Studio Online 源代码托管 企业管理软件开发框架
    解析大型.NET ERP系统 单据编码功能实现
    解析大型.NET ERP系统 单据标准(新增,修改,删除,复制,打印)功能程序设计
    Windows 10 部署Enterprise Solution 5.5
    解析大型.NET ERP系统 设计异常处理模块
    解析大型.NET ERP系统 业务逻辑设计与实现
    解析大型.NET ERP系统 多国语言实现
    Enterprise Solution 管理软件开发框架流程实战
    解析大型.NET ERP系统 数据审计功能
  • 原文地址:https://www.cnblogs.com/CheeseZH/p/5254082.html
Copyright © 2011-2022 走看看