聚类Clustering - 走看看

zoukankan html css js c++ java

聚类Clustering
聚类Clustering

This page describes clustering algorithms in MLlib. The guide for clustering in the RDD-based API also has relevant information about these algorithms. 本文描述MLlib中的聚类算法。基于RDD-API中的聚类指南提供了有关这些算法的相关信息。

Table of Contents
K-means

k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.

KMeans is implemented as an Estimator and generates a KMeansModel as the base model.

k均值是最常用的聚类算法之一，将数据点聚集成预定数量的聚类。MLlib实现包括k-means ++方法的并行变体，称为kmeans ||。。

KMeans实现，Estimator生成KMeansModel作为基本模型。

Examples
Scala

Java

Python

R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.clustering.KMeans

import org.apache.spark.ml.evaluation.ClusteringEvaluator

// Loads data.

val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

// Trains a k-means model.

val kmeans = new KMeans().setK(2).setSeed(1L)

val model = kmeans.fit(dataset)

// Make predictions

val predictions = model.transform(dataset)

// Evaluate clustering by computing Silhouette score

val evaluator = new ClusteringEvaluator()

val silhouette = evaluator.evaluate(predictions)

println(s"Silhouette with squared euclidean distance = $silhouette")

// Shows the result.

println("Cluster Centers: ")

model.clusterCenters.foreach(println)

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala" in the Spark repo.

Latent Dirichlet allocation (LDA)

LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed.

LDA实现Estimator，支持EMLDAOptimizer和OnlineLDAOptimizer，生成LDAModel作为基础模型。专家用户可以将LDAModel生成的 EMLDAOptimizer转换为DistributedLDAModel。

Examples
Scala

Java

Python

R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.clustering.LDA

// Loads data.

val dataset = spark.read.format("libsvm")

.load("data/mllib/sample_lda_libsvm_data.txt")

// Trains a LDA model.

val lda = new LDA().setK(10).setMaxIter(10)

val model = lda.fit(dataset)

val ll = model.logLikelihood(dataset)

val lp = model.logPerplexity(dataset)

println(s"The lower bound on the log likelihood of the entire corpus: $ll")

println(s"The upper bound on perplexity: $lp")

// Describe topics.

val topics = model.describeTopics(3)

println("The topics described by their top-weighted terms:")

topics.show(false)

// Shows the result.

val transformed = model.transform(dataset)

transformed.show(false)

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/LDAExample.scala" in the Spark repo.

Bisecting k-means

Bisecting k-means is a kind of hierarchical clustering using a divisive (or “top-down”) approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.

BisectingKMeans is implemented as an Estimator and generates a BisectingKMeansModel as the base model.

将k均值平分是一种使用除法（或“自上而下”）方法的分层聚类：所有观测值都在一个聚类中开始，当一个聚结向下移动时，递归执行拆分。

平分K均值通常会比常规K均值快得多，但通常会产生不同的聚类。

BisectingKMeans实现，Estimator并生成BisectingKMeansModel作为基本模型。

Examples
Scala

Java

Python

R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.clustering.BisectingKMeans

import org.apache.spark.ml.evaluation.ClusteringEvaluator

// Loads data.

val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

// Trains a bisecting k-means model.

val bkm = new BisectingKMeans().setK(2).setSeed(1)

val model = bkm.fit(dataset)

// Make predictions

val predictions = model.transform(dataset)

// Evaluate clustering by computing Silhouette score

val evaluator = new ClusteringEvaluator()

val silhouette = evaluator.evaluate(predictions)

println(s"Silhouette with squared euclidean distance = $silhouette")

// Shows the result.

println("Cluster Centers: ")

val centers = model.clusterCenters

centers.foreach(println)

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala" in the Spark repo.

Gaussian Mixture Model (GMM)

A Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own probability. The spark.ml implementation uses the expectation-maximization algorithm to induce the maximum-likelihood model given a set of samples.

GaussianMixture is implemented as an Estimator and generates a GaussianMixtureModel as the base model.

高斯混合模型代表一个复合分布，绘制ķ高斯子分布，每个具有其相应的概率。该spark.ml实现使用期望最大化算法，给定一组样本，得出最大似然模型。

GaussianMixture实现，Estimator并生成GaussianMixtureModel作为基本模型。

Examples
Scala

Java

Python

R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.clustering.GaussianMixture

// Loads data

val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

// Trains Gaussian Mixture Model

val gmm = new GaussianMixture()

.setK(2)

val model = gmm.fit(dataset)

// output parameters of mixture model model

for (i <- 0 until model.getK) {

println(s"Gaussian $i: weight=${model.weights(i)} " +

s"mu=${model.gaussians(i).mean} sigma= ${model.gaussians(i).cov} ")

}

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala" in the Spark repo.

Power Iteration Clustering (PIC)

Power Iteration Clustering (PIC) is a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.

spark.ml’s PowerIterationClustering implementation takes the following parameters:

功率迭代聚类（PIC）是Lin和Cohen开发的可伸缩图聚类算法。PIC在数据的标准化成对相似度矩阵上使用截断的幂次迭代，发现了数据集的非常低维的嵌入。

spark.ml的PowerIterationClustering实现采用以下参数：
- k: the number of clusters to create
- initMode: param for the initialization algorithm
- maxIter: param for maximum number of iterations
- srcCol: param for the name of the input column for source vertex IDs
- dstCol: name of the input column for destination vertex IDs
- weightCol: Param for weight column name
- k：要创建的聚类数
- initMode：初始化算法的参数
- maxIter：最大迭代次数的参数
- srcCol：参数，用于源顶点ID的输入列的名称
- dstCol：目标顶点ID的输入列的名称
- weightCol：权重列名称的参数
Examples
Scala

Java

Python

R
Refer to the Scala API docs for more details.

import org.apache.spark.ml.clustering.PowerIterationClustering

val dataset = spark.createDataFrame(Seq(

(0L, 1L, 1.0),

(0L, 2L, 1.0),

(1L, 2L, 1.0),

(3L, 4L, 1.0),

(4L, 0L, 0.1)

)).toDF("src", "dst", "weight")

val model = new PowerIterationClustering().

setK(2).

setMaxIter(20).

setInitMode("degree").

setWeightCol("weight")

val prediction = model.assignClusters(dataset).select("id", "cluster")

// Shows the cluster assignment

prediction.show(false)

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala" in the Spark repo.
人工智能芯片与自动驾驶
查看全文

相关阅读:
CodeForces 546C（队列）
N皇后摆放问题
 士兵队列
 货币问题
 C
B
ACM第三次比赛 Big Chocolate
ACM比赛（第三次D）
ACM第三次比赛UVA11877 The Coco-Cola Store
uva 10382

原文地址：https://www.cnblogs.com/wujianming-110117/p/14595178.html

最新文章
二叉树遍历
 新兵队列
 八数码块
 UVA
Lazy Math Instructor
shakes hands
求矩形的最大面积
 Friends
ice cave
ACM最大值最小化&&二分法