zoukankan      html  css  js  c++  java
  • python主题建模可视化LDA和T-SNE交互式可视化

    原文链接:http://tecdat.cn/?p=6917

    我尝试使用Latent Dirichlet分配LDA来提取一些主题。 本教程以端到端的自然语言处理流程为特色,从原始数据开始,贯穿准备,建模,可视化论文。

    我们将涉及以下几点

    使用LDA进行主题建模
    使用pyLDAvis可视化主题模型
    使用t-SNE和散景可视化LDA结果

    In [1]:

    from scipy import sparse as sp
    
    Populating the interactive namespace from numpy and matplotlib
    

    In [2]:

    docs = array(p_df['PaperText'])
    

     预处理和矢量化文档

    In [3]:

    from nltk.stem.wordnet import WordNetLemmatizer
    from nltk.tokenize import RegexpTokenizer
    
    def docs_preprocessor(docs):
        tokenizer = RegexpTokenizer(r'w+')
        for idx in range(len(docs)):
            docs[idx] = docs[idx].lower()  # Convert to lowercase.
            docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.
    
        # Remove numbers, but not words that contain numbers.
        docs = [[token for token in doc if not token.isdigit()] for doc in docs]
        
        # Remove words that are only one character.
        docs = [[token for token in doc if len(token) > 3] for doc in docs]
        
        # Lemmatize all words in documents.
        lemmatizer = WordNetLemmatizer()
        docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
      
        return docs
    

    In [4]:

    docs = docs_preprocessor(docs)
    

     计算双字母组/三元组:

     正弦主题非常相似,可以区分它们是短语而不是单个/单个单词。

    In [5]:

    from gensim.models import Phrases
    # Add bigrams and trigrams to docs (only ones that appear 10 times or more).
    bigram = Phrases(docs, min_count=10)
    trigram = Phrases(bigram[docs])
    
    for idx in range(len(docs)):
        for token in bigram[docs[idx]]:
            if '_' in token:
                # Token is a bigram, add to document.
                docs[idx].append(token)
        for token in trigram[docs[idx]]:
            if '_' in token:
                # Token is a bigram, add to document.
                docs[idx].append(token)
    
    Using TensorFlow backend.
    /opt/conda/lib/python3.6/site-packages/gensim/models/phrases.py:316: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
      warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")
    

    删除

    In [6]:

    from gensim.corpora import Dictionary
    
    # Create a dictionary representation of the documents.
    dictionary = Dictionary(docs)
    print('Number of unique words in initital documents:', len(dictionary))
    
    # Filter out words that occur less than 10 documents, or more than 20% of the documents.
    dictionary.filter_extremes(no_below=10, no_above=0.2)
    print('Number of unique words after removing rare and common words:', len(dictionary))
    
    Number of unique words in initital documents: 39534
    Number of unique words after removing rare and common words: 6001
    

    修剪常见和罕见的单词,我们最终只有大约6%的单词。

    矢量化数据:
    第一步是获得每个文档的单词表示。

    In [7]:

    corpus = [dictionary.doc2bow(doc) for doc in docs]
    

    In [8]:

    print('Number of unique tokens: %d' % len(dictionary))
    print('Number of documents: %d' % len(corpus))
    
    Number of unique tokens: 6001
    Number of documents: 403
    

    通过词袋语料库,我们可以继续从文档中学习我们的主题模型。

    训练LDA模型 

    In [9]:

    from gensim.models import LdaModel
    

    In [10]:

    %time model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, 
                           alpha='auto', eta='auto', 
                           iterations=iterations, num_topics=num_topics, 
                           passes=passes, eval_every=eval_every)
    
    CPU times: user 3min 58s, sys: 348 ms, total: 3min 58s
    Wall time: 3min 59s
    

    如何选择主题数量?


    LDA是一种无监督的技术,这意味着我们在运行模型之前不知道在我们的语料库中有多少主题存在。 主题连贯性是用于确定主题数量的主要技术之一。 

    但是,我使用了LDA可视化工具pyLDAvis,尝试了几个主题并比较了结果。 四个似乎是最能分离主题的最佳主题数量。

    In [11]:

    import pyLDAvis.gensim
    pyLDAvis.enable_notebook()
    
    import warnings
    warnings.filterwarnings("ignore", category=DeprecationWarning) 
    

    In [12]:

    pyLDAvis.gensim.prepare(model, corpus, dictionary)
    

    Out[12]:

    我们在这看到什么?

    左侧面板,标记为Intertopic Distance Map,圆圈表示不同的主题以及它们之间的距离。类似的主题看起来更近,而不同的主题更远。图中主题圆的相对大小对应于语料库中主题的相对频率。 

    如何评估我们的模型? 

    将每个文档分成两部分,看看分配给它们的主题是否类似。 =>越相似越好

    将随机选择的文档相互比较。 =>越不相似越好

    In [13]:

    from sklearn.metrics.pairwise import cosine_similarity
    
    p_df['tokenz'] = docs
    
    docs1 = p_df['tokenz'].apply(lambda l: l[:int0(len(l)/2)])
    docs2 = p_df['tokenz'].apply(lambda l: l[int0(len(l)/2):])
    

    Transform the data

    In [14]:

    corpus1 = [dictionary.doc2bow(doc) for doc in docs1]
    corpus2 = [dictionary.doc2bow(doc) for doc in docs2]
    
    # Using the corpus LDA model tranformation
    lda_corpus1 = model[corpus1]
    lda_corpus2 = model[corpus2]
    

    In [15]:

    from collections import OrderedDict
    def get_doc_topic_dist(model, corpus, kwords=False):
        
        '''
        LDA transformation, for each doc only returns topics with non-zero weight
        This function makes a matrix transformation of docs in the topic space.
        '''
        top_dist =[]
        keys = []
    
        for d in corpus:
            tmp = {i:0 for i in range(num_topics)}
            tmp.update(dict(model[d]))
            vals = list(OrderedDict(tmp).values())
            top_dist += [array(vals)]
            if kwords:
                keys += [array(vals).argmax()]
    
        return array(top_dist), keys
    
    Intra similarity: cosine similarity for corresponding parts of a doc(higher is better):
    0.906086532099
    Inter similarity: cosine similarity between random parts (lower is better):
    0.846485334252
    

     让我们看一下每个主题中出现的术语。

    In [17]:

    def explore_topic(lda_model, topic_number, topn, output=True):
        """
        accept a ldamodel, atopic number and topn vocabs of interest
        prints a formatted list of the topn terms
        """
        terms = []
        for term, frequency in lda_model.show_topic(topic_number, topn=topn):
            terms += [term]
            if output:
                print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))
        
        return terms
    

    In [18]:

    term                 frequency
    
    Topic 0 |---------------------
    
    data_set             0.006
    embedding            0.004
    query                0.004
    document             0.003
    tensor               0.003
    multi_label          0.003
    graphical_model      0.003
    singular_value       0.003
    topic_model          0.003
    margin               0.003
    Topic 1 |---------------------
    
    policy               0.007
    regret               0.007
    bandit               0.006
    reward               0.006
    active_learning      0.005
    agent                0.005
    vertex               0.005
    item                 0.005
    reward_function      0.005
    submodular           0.004
    Topic 2 |---------------------
    
    convolutional        0.005
    generative_model     0.005
    variational_inference 0.005
    recurrent            0.004
    gaussian_process     0.004
    fully_connected      0.004
    recurrent_neural     0.004
    hidden_unit          0.004
    deep_learning        0.004
    hidden_layer         0.004
    Topic 3 |---------------------
    
    convergence_rate     0.007
    step_size            0.006
    matrix_completion    0.006
    rank_matrix          0.005
    gradient_descent     0.005
    regret               0.004
    sample_complexity    0.004
    strongly_convex      0.004
    line_search          0.003
    sample_size          0.003
    

     从上面可以检查每个主题并为其分配一个人类可解释的标签。 在这里我将它们标记如下:

    In [19]:

    top_labels = {0: 'Statistics', 1:'Numerical Analysis', 2:'Online Learning', 3:'Deep Learning'}
    

    In [20]:

        '''
        # 1. Remove non-letters
        paper_text = re.sub("[^a-zA-Z]"," ", paper)
        # 2. Convert words to lower case and split them
        words = paper_text.lower().split()
        # 3. Remove stop words
        words = [w for w in words if not w in stops]
        # 4. Remove short words
        words = [t for t in words if len(t) > 2]
        # 5. lemmatizing
        words = [nltk.stem.WordNetLemmatizer().lemmatize(t) for t in words]
    
    In [21]:
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    tvectorizer = TfidfVectorizer(input='content', analyzer = 'word', lowercase=True, stop_words='english',
                                      tokenizer=paper_to_wordlist, ngram_range=(1, 3), min_df=40, max_df=0.20,
                                      norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=True)
    
    dtm = tvectorizer.fit_transform(p_df['PaperText']).toarray()
    

    In [22]:

    top_dist =[]
    for d in corpus:
        tmp = {i:0 for i in range(num_topics)}
        tmp.update(dict(model[d]))
        vals = list(OrderedDict(tmp).values())
        top_dist += [array(vals)]
    

    In [23]:

    top_dist, lda_keys= get_doc_topic_dist(model, corpus, True)
    features = tvectorizer.get_feature_names()
    

    In [24]:

    top_ws = []
    for n in range(len(dtm)):
        inds = int0(argsort(dtm[n])[::-1][:4])
        tmp = [features[i] for i in inds]
        
        top_ws += [' '.join(tmp)]
        
    
    
    cluster_colors = {0: 'blue', 1: 'green', 2: 'yellow', 3: 'red', 4: 'skyblue', 5:'salmon', 6:'orange', 7:'maroon', 8:'crimson', 9:'black', 10:'gray'}
    
    p_df['colors'] = p_df['clusters'].apply(lambda l: cluster_colors[l])
    

    In [25]:

    from sklearn.manifold import TSNE
    tsne = TSNE(n_components=2)
    X_tsne = tsne.fit_transform(top_dist)
    

    In [26]:

    p_df['X_tsne'] =X_tsne[:, 0]
    p_df['Y_tsne'] =X_tsne[:, 1]
    

    In [27]:

    from bokeh.plotting import figure, show, output_notebook, save#, output_file
    from bokeh.models import HoverTool, value, LabelSet, Legend, ColumnDataSource
    output_notebook()
    

     BokehJS 0.12.5成功加载。

    In [28]:

    source = ColumnDataSource(dict(
        x=p_df['X_tsne'],
        y=p_df['Y_tsne'],
        color=p_df['colors'],
        label=p_df['clusters'].apply(lambda l: top_labels[l]),
    #     msize= p_df['marker_size'],
        topic_key= p_df['clusters'],
        title= p_df[u'Title'],
        content = p_df['Text_Rep']
    ))
    

    In [29]:

    title = 'T-SNE visualization of topics'
    
    
    
    plot_lda.scatter(x='x', y='y', legend='label', source=source,
                     color='color', alpha=0.8, size=10)#'msize', )
    
    show(plot_lda)
    
    

     

     

    如果您有任何疑问,请在下面发表评论。 

  • 相关阅读:
    _ 下划线 Underscores __init__
    Page not found (404) 不被Django的exception中间件捕捉 中间件
    从装修儿童房时的门锁说起
    欧拉定理 费马小定理的推广
    线性运算 非线性运算
    Optimistic concurrency control 死锁 悲观锁 乐观锁 自旋锁
    Avoiding Full Table Scans
    批量的单向的ssh 认证
    批量的单向的ssh 认证
    Corrupted MAC on input at /usr/local/perl/lib/site_perl/5.22.1/x86_64-linux/Net/SSH/Perl/Packet.pm l
  • 原文地址:https://www.cnblogs.com/tecdat/p/11543087.html
Copyright © 2011-2022 走看看