zoukankan      html  css  js  c++  java
  • Clustering text documents using k-means of sklearn

    Clustering text documents using k-means

    https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py

        将20个新闻组数据下载,

        使用词频向量化工具等,提取文档特征,

        对特征实施kmeans聚类,

       最后评价聚类效果

    This is an example showing how the scikit-learn can be used to cluster documents by topics using a bag-of-words approach. This example uses a scipy.sparse matrix to store the features instead of standard numpy arrays.

    Two feature extraction methods can be used in this example:

    • TfidfVectorizer uses a in-memory vocabulary (a python dict) to map the most frequent words to features indices and hence compute a word occurrence frequency (sparse) matrix. The word frequencies are then reweighted using the Inverse Document Frequency (IDF) vector collected feature-wise over the corpus.

    • HashingVectorizer hashes word occurrences to a fixed dimensional space, possibly with collisions. The word count vectors are then normalized to each have l2-norm equal to one (projected to the euclidean unit-ball) which seems to be important for k-means to work in high dimensional space.

      HashingVectorizer does not provide IDF weighting as this is a stateless model (the fit method does nothing). When IDF weighting is needed it can be added by pipelining its output to a TfidfTransformer instance.

    Two algorithms are demoed: ordinary k-means and its more scalable cousin minibatch k-means.

    Additionally, latent semantic analysis can also be used to reduce dimensionality and discover latent patterns in the data.

    It can be noted that k-means (and minibatch k-means) are very sensitive to feature scaling and that in this case the IDF weighting helps improve the quality of the clustering by quite a lot as measured against the “ground truth” provided by the class label assignments of the 20 newsgroups dataset.

    This improvement is not visible in the Silhouette Coefficient which is small for both as this measure seem to suffer from the phenomenon called “Concentration of Measure” or “Curse of Dimensionality” for high dimensional datasets such as text data. Other measures such as V-measure and Adjusted Rand Index are information theoretic based evaluation scores: as they are only based on cluster assignments rather than distances, hence not affected by the curse of dimensionality.

    Note: as k-means is optimizing a non-convex objective function, it will likely end up in a local optimum. Several runs with independent random init might be necessary to get a good convergence.

    Code

    # Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
    #         Lars Buitinck
    # License: BSD 3 clause
    from sklearn.datasets import fetch_20newsgroups
    from sklearn.decomposition import TruncatedSVD
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.feature_extraction.text import HashingVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import Normalizer
    from sklearn import metrics
    
    from sklearn.cluster import KMeans, MiniBatchKMeans
    
    import logging
    from optparse import OptionParser
    import sys
    from time import time
    
    import numpy as np
    
    
    # Display progress logs on stdout
    logging.basicConfig(level=logging.INFO,
                        format='%(asctime)s %(levelname)s %(message)s')
    
    # parse commandline arguments
    op = OptionParser()
    op.add_option("--lsa",
                  dest="n_components", type="int",
                  help="Preprocess documents with latent semantic analysis.")
    op.add_option("--no-minibatch",
                  action="store_false", dest="minibatch", default=True,
                  help="Use ordinary k-means algorithm (in batch mode).")
    op.add_option("--no-idf",
                  action="store_false", dest="use_idf", default=True,
                  help="Disable Inverse Document Frequency feature weighting.")
    op.add_option("--use-hashing",
                  action="store_true", default=False,
                  help="Use a hashing feature vectorizer")
    op.add_option("--n-features", type=int, default=10000,
                  help="Maximum number of features (dimensions)"
                       " to extract from text.")
    op.add_option("--verbose",
                  action="store_true", dest="verbose", default=False,
                  help="Print progress reports inside k-means algorithm.")
    
    print(__doc__)
    op.print_help()
    
    
    def is_interactive():
        return not hasattr(sys.modules['__main__'], '__file__')
    
    
    # work-around for Jupyter notebook and IPython console
    argv = [] if is_interactive() else sys.argv[1:]
    (opts, args) = op.parse_args(argv)
    if len(args) > 0:
        op.error("this script takes no arguments.")
        sys.exit(1)
    
    
    # #############################################################################
    # Load some categories from the training set
    categories = [
        'alt.atheism',
        'talk.religion.misc',
        'comp.graphics',
        'sci.space',
    ]
    # Uncomment the following to do the analysis on all the categories
    # categories = None
    
    print("Loading 20 newsgroups dataset for categories:")
    print(categories)
    
    dataset = fetch_20newsgroups(subset='all', categories=categories,
                                 shuffle=True, random_state=42)
    
    print("%d documents" % len(dataset.data))
    print("%d categories" % len(dataset.target_names))
    print()
    
    labels = dataset.target
    true_k = np.unique(labels).shape[0]
    
    print("Extracting features from the training dataset "
          "using a sparse vectorizer")
    t0 = time()
    if opts.use_hashing:
        if opts.use_idf:
            # Perform an IDF normalization on the output of HashingVectorizer
            hasher = HashingVectorizer(n_features=opts.n_features,
                                       stop_words='english', alternate_sign=False,
                                       norm=None)
            vectorizer = make_pipeline(hasher, TfidfTransformer())
        else:
            vectorizer = HashingVectorizer(n_features=opts.n_features,
                                           stop_words='english',
                                           alternate_sign=False, norm='l2')
    else:
        vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
                                     min_df=2, stop_words='english',
                                     use_idf=opts.use_idf)
    X = vectorizer.fit_transform(dataset.data)
    
    print("done in %fs" % (time() - t0))
    print("n_samples: %d, n_features: %d" % X.shape)
    print()
    
    if opts.n_components:
        print("Performing dimensionality reduction using LSA")
        t0 = time()
        # Vectorizer results are normalized, which makes KMeans behave as
        # spherical k-means for better results. Since LSA/SVD results are
        # not normalized, we have to redo the normalization.
        svd = TruncatedSVD(opts.n_components)
        normalizer = Normalizer(copy=False)
        lsa = make_pipeline(svd, normalizer)
    
        X = lsa.fit_transform(X)
    
        print("done in %fs" % (time() - t0))
    
        explained_variance = svd.explained_variance_ratio_.sum()
        print("Explained variance of the SVD step: {}%".format(
            int(explained_variance * 100)))
    
        print()
    
    
    # #############################################################################
    # Do the actual clustering
    
    if opts.minibatch:
        km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
                             init_size=1000, batch_size=1000, verbose=opts.verbose)
    else:
        km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
                    verbose=opts.verbose)
    
    print("Clustering sparse data with %s" % km)
    t0 = time()
    km.fit(X)
    print("done in %0.3fs" % (time() - t0))
    print()
    
    print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
    print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
    print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
    print("Adjusted Rand-Index: %.3f"
          % metrics.adjusted_rand_score(labels, km.labels_))
    print("Silhouette Coefficient: %0.3f"
          % metrics.silhouette_score(X, km.labels_, sample_size=1000))
    
    print()
    
    
    if not opts.use_hashing:
        print("Top terms per cluster:")
    
        if opts.n_components:
            original_space_centroids = svd.inverse_transform(km.cluster_centers_)
            order_centroids = original_space_centroids.argsort()[:, ::-1]
        else:
            order_centroids = km.cluster_centers_.argsort()[:, ::-1]
    
        terms = vectorizer.get_feature_names()
        for i in range(true_k):
            print("Cluster %d:" % i, end='')
            for ind in order_centroids[i, :10]:
                print(' %s' % terms[ind], end='')
            print()

    Output

    3387 documents
    4 categories
    
    Extracting features from the training dataset using a sparse vectorizer
    done in 0.820565s
    n_samples: 3387, n_features: 10000
    
    Clustering sparse data with MiniBatchKMeans(batch_size=1000, init_size=1000, n_clusters=4, n_init=1,
                    verbose=False)
    done in 0.065s
    
    Homogeneity: 0.219
    Completeness: 0.338
    V-measure: 0.266
    Adjusted Rand-Index: 0.113
    Silhouette Coefficient: 0.005
    
    Top terms per cluster:
    Cluster 0: cc ibm au buffalo monash com vnet software nicho university
    Cluster 1: space nasa henry access digex toronto gov pat alaska shuttle
    Cluster 2: com god university article don know graphics people posting like
    Cluster 3: sgi keith livesey morality jon solntze wpd caltech objective moral

    Normalizer

    https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html

          对特征进行范数归一处理。有利于比较向量相似性。

    Normalize samples individually to unit norm.

    Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

    This transformer is able to work both with dense numpy arrays and scipy.sparse matrix (use CSR format if you want to avoid the burden of a copy / conversion).

    Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.

    默认为 L2范数

    >>> from sklearn.preprocessing import Normalizer
    >>> X = [[4, 1, 2, 2],
    ...      [1, 3, 9, 3],
    ...      [5, 7, 5, 1]]
    >>> transformer = Normalizer().fit(X)  # fit does nothing.
    >>> transformer
    Normalizer()
    >>> transformer.transform(X)
    array([[0.8, 0.2, 0.4, 0.4],
           [0.1, 0.3, 0.9, 0.3],
           [0.5, 0.7, 0.5, 0.1]])

    https://www.jianshu.com/p/6cf5d60db634

         范数

    理解L1,L2 范数

    L1,L2 范数即 L1-normL2-norm,自然,有L1、L2便也有L0、L3等等。因为在机器学习领域,L1 和 L2 范数应用比较多,比如作为正则项在回归中的使用 Lasso Regression(L1) 和 Ridge Regression(L2)。

    因此,此两者的辨析也总被提及,或是考到。不过在说明两者定义和区别前,先来谈谈什么是范数(Norm)吧。

    什么是范数?

    在线性代数以及一些数学领域中,norm 的定义是

    a function that assigns a strictly positive length or size to each vector in a vector space, except for the zero vector. ——Wikipedia

    简单点说,一个向量的 norm 就是将该向量投影到 [0, ​) 范围内的值,其中 0 值只有零向量的 norm 取到。看到这样的一个范围,相信大家就能想到其与现实中距离的类比,于是在机器学习中 norm 也就总被拿来表示距离关系:根据怎样怎样的范数,这两个向量有多远。

    上面这个怎样怎样也就是范数种类,通常我们称​为p-norm,严格定义是:

    其中当 p 取 1 时被称为 1-norm,也就是提到的 L1-norm,同理 L2-norm 可得。

    L1 和 L2 范数的定义

    根据上述公式 L1-norm 和 L2-norm 的定义也就自然而然得到了。

    先将 p=1 代入公式,就有了 L1-norm 的定义:

    然后代入 p=2,L2-norm 也有了:

    L2 展开就是熟悉的欧几里得范数:

    题外话,其中 L1-norm 又叫做 taxicab-norm 或者 Manhattan-norm,可能最早提出的大神直接用在曼哈顿区坐出租车来做比喻吧。下图中绿线是两个黑点的 L2 距离,而其他几根就是 taxicab 也就是 L1 距离,确实很像我们平时用地图时走的路线了。

    L1 和 L2 范数在机器学习上最主要的应用大概分下面两类

    • 作为损失函数使用

    • 作为正则项使用也即所谓 L1-regularizationL2-regularization

    出处:http://www.cnblogs.com/lightsong/ 本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。
  • 相关阅读:
    【NET】File操作练习笔记
    【微信小程序】分包的使用和预下载
    【微信小程序】组件Component常用案例
    MVCC多版本并发控制
    数据存储引擎
    seata 分布式事务 -- seata-three工程完整代码
    seata 分布式事务 -- seata-two工程完整代码
    seata 分布式事务 -- seata-one 工程完整代码
    seata 分布式事务 -- 准备工作
    seata 分布式事务 -- TCC模式
  • 原文地址:https://www.cnblogs.com/lightsong/p/14308977.html
Copyright © 2011-2022 走看看