zoukankan      html  css  js  c++  java
  • SK-Learn使用NMF(非负矩阵分解)和LDA(隐含狄利克雷分布)进行话题抽取

    英文链接:http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html

    这是一个使用NMF和LDA对一个语料集进行话题抽取的例子。

    输入分别是是tf-idf矩阵(NMF)和tf矩阵(LDA)。

    输出是一系列的话题,每个话题由一系列的词组成。

    默认的参数(n_samples/n_features/n_topics)会使这个例子运行数十秒。

    你可以尝试修改问题的规模,但是要注意,NMF的时间复杂度是多项式级别的,LDA的时间复杂度与(n_samples*iterations)成正比。

    几点注意事项:

    (1)其中line 61的代码需要注释掉,才能看到输出结果。

    (2)第一次运行代码,程序会从网上下载新闻数据,然后保存在一个缓存目录中,之后再运行代码,就不会重复下载了。

    (3)关于NMF和LDA的参数设置,可以到sklearn的官网上查看【NMF官方文档】【LDA官方文档】。

    (4)该代码对应的sk-learn版本为 scikit-learn 0.17.1

    代码:

     1 # Author: Olivier Grisel <olivier.grisel@ensta.org>
     2 #         Lars Buitinck <L.J.Buitinck@uva.nl>
     3 #         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
     4 # License: BSD 3 clause
     5 
     6 from __future__ import print_function
     7 from time import time
     8 
     9 from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
    10 from sklearn.decomposition import NMF, LatentDirichletAllocation
    11 from sklearn.datasets import fetch_20newsgroups
    12 
    13 n_samples = 2000
    14 n_features = 1000
    15 n_topics = 10
    16 n_top_words = 20
    17 
    18 
    19 def print_top_words(model, feature_names, n_top_words):
    20     for topic_idx, topic in enumerate(model.components_):
    21         print("Topic #%d:" % topic_idx)
    22         print(" ".join([feature_names[i]
    23                         for i in topic.argsort()[:-n_top_words - 1:-1]]))
    24     print()
    25 
    26 
    27 # Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
    28 # to filter out useless terms early on: the posts are stripped of headers,
    29 # footers and quoted replies, and common English words, words occurring in
    30 # only one document or in at least 95% of the documents are removed.
    31 
    32 print("Loading dataset...")
    33 t0 = time()
    34 dataset = fetch_20newsgroups(shuffle=True, random_state=1,
    35                              remove=('headers', 'footers', 'quotes'))
    36 data_samples = dataset.data
    37 print("done in %0.3fs." % (time() - t0))
    38 
    39 # Use tf-idf features for NMF.
    40 print("Extracting tf-idf features for NMF...")
    41 tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, #max_features=n_features,
    42                                    stop_words='english')
    43 t0 = time()
    44 tfidf = tfidf_vectorizer.fit_transform(data_samples)
    45 print("done in %0.3fs." % (time() - t0))
    46 
    47 # Use tf (raw term count) features for LDA.
    48 print("Extracting tf features for LDA...")
    49 tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,
    50                                 stop_words='english')
    51 t0 = time()
    52 tf = tf_vectorizer.fit_transform(data_samples)
    53 print("done in %0.3fs." % (time() - t0))
    54 
    55 # Fit the NMF model
    56 print("Fitting the NMF model with tf-idf features,"
    57       "n_samples=%d and n_features=%d..."
    58       % (n_samples, n_features))
    59 t0 = time()
    60 nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
    61 exit()
    62 print("done in %0.3fs." % (time() - t0))
    63 
    64 print("
    Topics in NMF model:")
    65 tfidf_feature_names = tfidf_vectorizer.get_feature_names()
    66 print_top_words(nmf, tfidf_feature_names, n_top_words)
    67 
    68 print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
    69       % (n_samples, n_features))
    70 lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
    71                                 learning_method='online', learning_offset=50.,
    72                                 random_state=0)
    73 t0 = time()
    74 lda.fit(tf)
    75 print("done in %0.3fs." % (time() - t0))
    76 
    77 print("
    Topics in LDA model:")
    78 tf_feature_names = tf_vectorizer.get_feature_names()
    79 print_top_words(lda, tf_feature_names, n_top_words)

    结果:

    Loading dataset...
    done in 2.222s.
    Extracting tf-idf features for NMF...
    done in 2.730s.
    Extracting tf features for LDA...
    done in 2.702s.
    Fitting the NMF model with tf-idf features,n_samples=2000 and n_features=1000...
    done in 1.904s.
    
    Topics in NMF model:
    Topic #0:
    don just people think like know good time right ve say did make really way want going new year ll
    Topic #1:
    windows thanks file card does dos mail files know program use advance hi window help software looking ftp video pc
    Topic #2:
    drive scsi ide drives disk controller hard floppy bus hd cd boot mac cable card isa rom motherboard mb internal
    Topic #3:
    key chip encryption clipper keys escrow government algorithm security secure encrypted public nsa des enforcement law privacy bit use secret
    Topic #4:
    00 sale 50 shipping 20 10 price 15 new 25 30 dos offer condition 40 cover asking 75 01 interested
    Topic #5:
    armenian armenians turkish genocide armenia turks turkey soviet people muslim azerbaijan russian greek argic government serdar kurds population ottoman million
    Topic #6:
    god jesus bible christ faith believe christians christian heaven sin life hell church truth lord does say belief people existence
    Topic #7:
    mouse driver keyboard serial com1 port bus com3 irq button com sys microsoft ball problem modem adb drivers card com2
    Topic #8:
    space nasa shuttle launch station sci gov orbit moon earth lunar satellite program mission center cost research data solar mars
    Topic #9:
    msg food chinese flavor eat glutamate restaurant foods reaction taste restaurants salt effects carl brain people ingredients natural causes olney
    
    Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
    done in 22.548s.
    
    Topics in LDA model:
    Topic #0:
    government people mr law gun state president states public use right rights national new control american security encryption health united
    Topic #1:
    drive card disk bit scsi use mac memory thanks pc does video hard speed apple problem used data monitor software
    Topic #2:
    said people armenian armenians turkish did saw went came women killed children turkey told dead didn left started greek war
    Topic #3:
    year good just time game car team years like think don got new play games ago did season better ll
    Topic #4:
    10 00 15 25 12 11 20 14 17 16 db 13 18 24 30 19 27 50 21 40
    Topic #5:
    windows window program version file dos use files available display server using application set edu motif package code ms software
    Topic #6:
    edu file space com information mail data send available program ftp email entry info list output nasa address anonymous internet
    Topic #7:
    ax max b8f g9v a86 pl 145 1d9 0t 34u 1t 3t giz bhj wm 2di 75u 2tm bxn 7ey
    Topic #8:
    god people jesus believe does say think israel christian true life jews did bible don just know world way church
    Topic #9:
    don know like just think ve want does use good people key time way make problem really work say need
  • 相关阅读:
    LIS例题
    基数排序板子
    lower_bound和upper_bound在刷leetcode的时候...
    leetcode1081/316 求字典序最小的包含所有出现字符一次的子序列
    PHP 求多个数组的笛卡尔积,适用于求商品规格组合 【深度优先搜索】【原创】
    PHP 求多个数组的笛卡尔积,适用于求商品规格组合【原创】
    Spring 中注入 properties 中的值
    Java 枚举活用
    Intellij IDEA 快捷键整理(TonyCody)
    WIN API -- 2.Hello World
  • 原文地址:https://www.cnblogs.com/CheeseZH/p/5254082.html
Copyright © 2011-2022 走看看