zoukankan      html  css  js  c++  java
  • Kmeans clustering and vector quantization

    K-means clustering and vector quantization (scipy.cluster.vq) — SciPy v0.11 Reference Guide (DRAFT)

    K-means clustering and vector quantization (scipy.cluster.vq)

    Provides routines for k-means clustering, generating code books from k-means models, and quantizing vectors by comparing them with centroids in a code book.

    whiten(obs)Normalize a group of observations on a per feature basis.
    vq(obs, code_book)Assign codes from a code book to observations.
    kmeans(obs, k_or_guess[, iter, thresh])Performs k-means on a set of observation vectors forming k clusters.
    kmeans2(data, k[, iter, thresh, minit, missing])Classify a set of observations into k clusters using the k-means algorithm.

    Background information

    The k-means algorithm takes as input the number of clusters to generate, k, and a set of observation vectors to cluster. It returns a set of centroids, one for each of the k clusters. An observation vector is classified with the cluster number or centroid index of the centroid closest to it.

    A vector v belongs to cluster i if it is closer to centroid i than any other centroids. If v belongs to i, we say centroid i is the dominating centroid of v. The k-means algorithm tries to minimize distortion, which is defined as the sum of the squared distances between each observation vector and its dominating centroid. Each step of the k-means algorithm refines the choices of centroids to reduce distortion. The change in distortion is used as a stopping criterion: when the change is lower than a threshold, the k-means algorithm is not making sufficient progress and terminates. One can also define a maximum number of iterations.

    Since vector quantization is a natural application for k-means, information theory terminology is often used. The centroid index or cluster index is also referred to as a “code” and the table mapping codes to centroids and vice versa is often referred as a “code book”. The result of k-means, a set of centroids, can be used to quantize vectors. Quantization aims to find an encoding of vectors that reduces the expected distortion.

    All routines expect obs to be a M by N array where the rows are the observation vectors. The codebook is a k by N array where the i’th row is the centroid of code word i. The observation vectors and centroids have the same feature dimension.

    As an example, suppose we wish to compress a 24-bit color image (each pixel is represented by one byte for red, one for blue, and one for green) before sending it over the web. By using a smaller 8-bit encoding, we can reduce the amount of data by two thirds. Ideally, the colors for each of the 256 possible 8-bit encoding values should be chosen to minimize distortion of the color. Running k-means with k=256 generates a code book of 256 codes, which fills up all possible 8-bit sequences. Instead of sending a 3-byte value for each pixel, the 8-bit centroid index (or code word) of the dominating centroid is transmitted. The code book is also sent over the wire so each 8-bit code can be translated back to a 24-bit pixel value representation. If the image of interest was of an ocean, we would expect many 24-bit blues to be represented by 8-bit codes. If it was an image of a human face, more flesh tone colors would be represented in the code book.

  • 相关阅读:
    Django: ModelForm中Meta的fields等成员介绍
    python的random函数
    设置mysql隔离级别
    ubantu 下 修改mysql 默认编码
    jdbc 模板 连接
    sql 注入 与解决
    jdbc 简单连接
    动态代理 例子
    自定义的一个数据输入类
    类加载器 读取配置文件
  • 原文地址:https://www.cnblogs.com/lexus/p/2808325.html
Copyright © 2011-2022 走看看