zoukankan      html  css  js  c++  java
  • 聚类Clustering

    聚类Clustering

    This page describes clustering algorithms in MLlib. The guide for clustering in the RDD-based API also has relevant information about these algorithms. 本文描述MLlib中的聚类算法。基于RDD-API中的聚类指南提供了有关这些算法的相关信息。

    Table of Contents

    K-means

    k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.

    KMeans is implemented as an Estimator and generates a KMeansModel as the base model.

    k均值是最常用的聚类算法之一,将数据点聚集成预定数量的聚类。MLlib实现包括k-means ++方法的并行变体,称为kmeans ||。

    KMeans实现,Estimator生成KMeansModel作为基本模型。

     

    Examples

    Refer to the Scala API docs for more details.

    import org.apache.spark.ml.clustering.KMeans

    import org.apache.spark.ml.evaluation.ClusteringEvaluator

     

    // Loads data.

    val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

     

    // Trains a k-means model.

    val kmeans = new KMeans().setK(2).setSeed(1L)

    val model = kmeans.fit(dataset)

     

    // Make predictions

    val predictions = model.transform(dataset)

     

    // Evaluate clustering by computing Silhouette score

    val evaluator = new ClusteringEvaluator()

     

    val silhouette = evaluator.evaluate(predictions)

    println(s"Silhouette with squared euclidean distance = $silhouette")

     

    // Shows the result.

    println("Cluster Centers: ")

    model.clusterCenters.foreach(println)

    Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala" in the Spark repo.

    Latent Dirichlet allocation (LDA)

    LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed.

    LDA实现Estimator支持EMLDAOptimizerOnlineLDAOptimizer,生成LDAModel作为基础模型。专家用户可以将LDAModel生成的 EMLDAOptimizer转换为DistributedLDAModel

    Examples

    Refer to the Scala API docs for more details.

    import org.apache.spark.ml.clustering.LDA

     

    // Loads data.

    val dataset = spark.read.format("libsvm")

      .load("data/mllib/sample_lda_libsvm_data.txt")

     

    // Trains a LDA model.

    val lda = new LDA().setK(10).setMaxIter(10)

    val model = lda.fit(dataset)

     

    val ll = model.logLikelihood(dataset)

    val lp = model.logPerplexity(dataset)

    println(s"The lower bound on the log likelihood of the entire corpus: $ll")

    println(s"The upper bound on perplexity: $lp")

     

    // Describe topics.

    val topics = model.describeTopics(3)

    println("The topics described by their top-weighted terms:")

    topics.show(false)

     

    // Shows the result.

    val transformed = model.transform(dataset)

    transformed.show(false)

    Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/LDAExample.scala" in the Spark repo.

    Bisecting k-means

    Bisecting k-means is a kind of hierarchical clustering using a divisive (or “top-down”) approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

    Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.

    BisectingKMeans is implemented as an Estimator and generates a BisectingKMeansModel as the base model.

    将k均值平分是一种使用除法(或“自上而下”)方法的分层聚类:所有观测值都在一个聚类中开始,当一个聚结向下移动时,递归执行拆分。

    平分K均值通常会比常规K均值快得多,但通常会产生不同的聚类。

    BisectingKMeans实现,Estimator并生成BisectingKMeansModel作为基本模型。

    Examples

    Refer to the Scala API docs for more details.

    import org.apache.spark.ml.clustering.BisectingKMeans

    import org.apache.spark.ml.evaluation.ClusteringEvaluator

     

    // Loads data.

    val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

     

    // Trains a bisecting k-means model.

    val bkm = new BisectingKMeans().setK(2).setSeed(1)

    val model = bkm.fit(dataset)

     

    // Make predictions

    val predictions = model.transform(dataset)

     

    // Evaluate clustering by computing Silhouette score

    val evaluator = new ClusteringEvaluator()

     

    val silhouette = evaluator.evaluate(predictions)

    println(s"Silhouette with squared euclidean distance = $silhouette")

     

    // Shows the result.

    println("Cluster Centers: ")

    val centers = model.clusterCenters

    centers.foreach(println)

    Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala" in the Spark repo.

    Gaussian Mixture Model (GMM)

    Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own probability. The spark.ml implementation uses the expectation-maximization algorithm to induce the maximum-likelihood model given a set of samples.

    GaussianMixture is implemented as an Estimator and generates a GaussianMixtureModel as the base model.

    高斯混合模型 代表一个复合分布,绘制ķ高斯子分布,每个具有其相应的概率。该spark.ml实现使用 期望最大化 算法,给定一组样本,得出最大似然模型。

    GaussianMixture实现,Estimator并生成GaussianMixtureModel作为基本模型。

     

    Examples

    Refer to the Scala API docs for more details.

    import org.apache.spark.ml.clustering.GaussianMixture

     

    // Loads data

    val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

     

    // Trains Gaussian Mixture Model

    val gmm = new GaussianMixture()

      .setK(2)

    val model = gmm.fit(dataset)

     

    // output parameters of mixture model model

    for (i <- 0 until model.getK) {

      println(s"Gaussian $i: weight=${model.weights(i)} " +

          s"mu=${model.gaussians(i).mean} sigma= ${model.gaussians(i).cov} ")

    }

    Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala" in the Spark repo.

    Power Iteration Clustering (PIC)

    Power Iteration Clustering (PIC) is a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.

    spark.ml’s PowerIterationClustering implementation takes the following parameters:

    功率迭代聚类(PIC)是Lin和Cohen开发的可伸缩图聚类算法。PIC在数据的标准化成对相似度矩阵上使用截断的幂次迭代,发现了数据集的非常低维的嵌入。

    spark.ml的PowerIterationClustering实现采用以下参数:

    • k: the number of clusters to create
    • initMode: param for the initialization algorithm
    • maxIter: param for maximum number of iterations
    • srcCol: param for the name of the input column for source vertex IDs
    • dstCol: name of the input column for destination vertex IDs
    • weightCol: Param for weight column name
    • k:要创建的聚类数
    • initMode:初始化算法的参数
    • maxIter:最大迭代次数的参数
    • srcCol:参数,用于源顶点ID的输入列的名称
    • dstCol:目标顶点ID的输入列的名称
    • weightCol:权重列名称的参数

    Examples

    Refer to the Scala API docs for more details.

    import org.apache.spark.ml.clustering.PowerIterationClustering

     

    val dataset = spark.createDataFrame(Seq(

      (0L, 1L, 1.0),

      (0L, 2L, 1.0),

      (1L, 2L, 1.0),

      (3L, 4L, 1.0),

      (4L, 0L, 0.1)

    )).toDF("src", "dst", "weight")

     

    val model = new PowerIterationClustering().

      setK(2).

      setMaxIter(20).

      setInitMode("degree").

      setWeightCol("weight")

     

    val prediction = model.assignClusters(dataset).select("id", "cluster")

     

    //  Shows the cluster assignment

    prediction.show(false)

    Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala" in the Spark repo.

     

    人工智能芯片与自动驾驶
  • 相关阅读:
    Hadoop深入学习:MapTask详解
    设计模式系列——三个工厂模式(简单工厂模式,工厂方法模式,抽象工厂模式)
    GIT使用教程与基本原理
    网络爬虫浅析
    字符串模式匹配sunday算法
    linux ---- diff命令
    递归树的算法分析
    二叉树非递归实现
    链表相邻元素交换
    明星问题
  • 原文地址:https://www.cnblogs.com/wujianming-110117/p/14595178.html
Copyright © 2011-2022 走看看