zoukankan      html  css  js  c++  java
  • [ML]使用word2vec做kmeans聚类

    本文使用word2vec(100维)做聚类,训练文本中一行是一条数据(已分词),具体代码如下:

    from sklearn.cluster import KMeans  
    from sklearn import preprocessing
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    #from sklearn.decomposition import PCA
    from gensim.models import Word2Vec
    import nltk
    from nltk.corpus import stopwords
    #from sklearn.model_selection import train_test_split
    import random
    import matplotlib.pyplot as plt
    %matplotlib inline
    #from sklearn.datasets.samples_generator import make_blob

    加载文本:

    sents = []
    #sents:已分好词的文件,一行是一条数据,已经分好词并去掉停用词
    with open('generate_data/sents_for_kmeans.txt','r',encoding='utf-8') as f:
        for line in f:
            sents.append(line.replace('
    ',''))

    文本去重:

    sents = list(set(sents))
    print(len(sents))
    print(sents[10])

    结果如下:

    67760
    含羞草 芒果 500g 大礼包 散装 无丝 软糯 芒果 100g
    
    训练word2vec模型:
    all_words = [sent.split(' ') for sent in sents]
    word2vec = Word2Vec(all_words)

    查看词典:

    vocabulary = word2vec.wv.vocab
    print(vocabulary.keys())
    len(vocabulary)

    将所有的词向量汇合到一个list中:

    vectors = []
    for item in vocabulary:
        vectors.append(word2vec.wv[item])

    训练kmeans模型:

    num_clusters = 2
    km_cluster = KMeans(n_clusters=num_clusters, max_iter=300, n_init=40, init='k-means++',n_jobs=-1)  
    #返回各自文本的所被分配到的类索引 
    #result = km_cluster.fit_predict(vectors)  
    #print("Predicting result: ", result)
    km_cluster.fit(vectors)

    图形化展示:

    cents = km_cluster.cluster_centers_
    labels = km_cluster.labels_
    inertia = km_cluster.inertia_
    mark = ['or','ob']
    color = 0
    j = 0
    for i in labels:
        #print(vectors[j])
        plt.plot(vectors[j],mark[i],markersize=5)
        j += 1
    plt.show()
  • 相关阅读:
    Github开始强制使用PAT(Personal Access Token)了
    STM32F401的外部中断EXTI
    STM32F401的PWM输出
    STM32F103和STM32F401的ADC多通道采集DMA输出
    nRF24L01无线模块笔记
    51单片机(STC89C52)在Ubuntu下的开发
    51单片机(STC89C52)的中断和定时器
    51单片机的软件和硬件PCA/PWM输出
    Zadig 云原生持续交付 面向开发者设计的开源、高可用 CI/CD
    人生 乐观 悲观 英雄
  • 原文地址:https://www.cnblogs.com/mj-selina/p/14357708.html
Copyright © 2011-2022 走看看