zoukankan      html  css  js  c++  java
  • Spark GraphX图计算核心源码分析【图构建器、顶点、边】

    一.图构建器

      GraphX提供了几种从RDD或磁盘上的顶点和边的集合构建图形的方法。默认情况下,没有图构建器会重新划分图的边;相反,边保留在默认分区中。Graph.groupEdges要求对图进行重新分区,因为它假定相同的边将在同一分区上放置,因此在调用Graph.partitionBy之前必须要调用groupEdges。 

    源码如下:

     1 package org.apache.spark.graphx
     2 
     3 import org.apache.spark.SparkContext
     4 import org.apache.spark.graphx.impl.{EdgePartitionBuilder, GraphImpl}
     5 import org.apache.spark.internal.Logging
     6 import org.apache.spark.storage.StorageLevel
     7 
     8 /**
     9  * Provides utilities for loading [[Graph]]s from files.
    10  */
    11 object GraphLoader extends Logging {
    12 
    13   /**
    14    * Loads a graph from an edge list formatted file where each line contains two integers: a source
    15    * id and a target id. Skips lines that begin with `#`.
    16    */
    17   def edgeListFile(
    18       sc: SparkContext,
    19       path: String,
    20       canonicalOrientation: Boolean = false,
    21       numEdgePartitions: Int = -1,
    22       edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY, //缓存级别
    23       vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
    24     : Graph[Int, Int] =
    25   {
    26     val startTime = System.currentTimeMillis
    27 
    28     // Parse the edge data table directly into edge partitions
    29     val lines =
    30       if (numEdgePartitions > 0) { // 加载文件数据
    31         sc.textFile(path, numEdgePartitions).coalesce(numEdgePartitions)
    32       } else {
    33         sc.textFile(path)
    34       } // 按照分区进行图构建
    35     val edges = lines.mapPartitionsWithIndex { (pid, iter) =>
    36       val builder = new EdgePartitionBuilder[Int, Int]
    37       iter.foreach { line =>
    38         if (!line.isEmpty && line(0) != '#') { // 过滤注释行
    39           val lineArray = line.split("\s+")
    40           if (lineArray.length < 2) { // 识别异常数据
    41             throw new IllegalArgumentException("Invalid line: " + line)
    42           }
    43           val srcId = lineArray(0).toLong
    44           val dstId = lineArray(1).toLong
    45           if (canonicalOrientation && srcId > dstId) {
    46             builder.add(dstId, srcId, 1)// 逐个添加边及权重
    47           } else {
    48             builder.add(srcId, dstId, 1)
    49           }
    50         }
    51       }
    52       Iterator((pid, builder.toEdgePartition))
    53     }.persist(edgeStorageLevel).setName("GraphLoader.edgeListFile - edges (%s)".format(path))
    54     edges.count() // 触发执行
    55 
    56     logInfo("It took %d ms to load the edges".format(System.currentTimeMillis - startTime))
    57 
    58     GraphImpl.fromEdgePartitions(edges, defaultVertexAttr = 1, edgeStorageLevel = edgeStorageLevel,
    59       vertexStorageLevel = vertexStorageLevel)
    60   } // end of edgeListFile
    61 
    62 }

    源码分析:

      GraphLoader.edgeListFile是从磁盘或HDFS类似的文件系统中加载图形数据,解析为(源顶点ID, 目标顶点ID)对的邻接列表,并跳过注释行。Graph从指定的边开始创建,然后自动创建和边相邻的任何节点。所有顶点和边属性均默认为1。参数canonicalOrientation允许沿正方向重新定向边,这是所有连接算法所必须的。

    源码如下:

     1 /**
     2  * The Graph object contains a collection of routines used to construct graphs from RDDs.
     3  */
     4 object Graph {
     5 
     6   /**
     7    * Construct a graph from a collection of edges encoded as vertex id pairs.
     8    *
     9    * @param rawEdges a collection of edges in (src, dst) form
    10    * @param defaultValue the vertex attributes with which to create vertices referenced by the edges
    11    * @param uniqueEdges if multiple identical edges are found they are combined and the edge
    12    * attribute is set to the sum.  Otherwise duplicate edges are treated as separate. To enable
    13    * `uniqueEdges`, a [[PartitionStrategy]] must be provided.
    14    * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
    15    * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
    16    *
    17    * @return a graph with edge attributes containing either the count of duplicate edges or 1
    18    * (if `uniqueEdges` is `None`) and vertex attributes containing the total degree of each vertex.
    19    */
    20   def fromEdgeTuples[VD: ClassTag](
    21       rawEdges: RDD[(VertexId, VertexId)],
    22       defaultValue: VD,
    23       uniqueEdges: Option[PartitionStrategy] = None,
    24       edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
    25       vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, Int] =
    26   {
    27     val edges = rawEdges.map(p => Edge(p._1, p._2, 1))
    28     val graph = GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)
    29     uniqueEdges match {
    30       case Some(p) => graph.partitionBy(p).groupEdges((a, b) => a + b)
    31       case None => graph
    32     }
    33   }
    34 
    35   /**
    36    * Construct a graph from a collection of edges.
    37    *
    38    * @param edges the RDD containing the set of edges in the graph
    39    * @param defaultValue the default vertex attribute to use for each vertex
    40    * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
    41    * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
    42    *
    43    * @return a graph with edge attributes described by `edges` and vertices
    44    *         given by all vertices in `edges` with value `defaultValue`
    45    */
    46   def fromEdges[VD: ClassTag, ED: ClassTag](
    47       edges: RDD[Edge[ED]],
    48       defaultValue: VD,
    49       edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
    50       vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
    51     GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)
    52   }
    53 
    54   /**
    55    * Construct a graph from a collection of vertices and
    56    * edges with attributes.  Duplicate vertices are picked arbitrarily and
    57    * vertices found in the edge collection but not in the input
    58    * vertices are assigned the default attribute.
    59    *
    60    * @tparam VD the vertex attribute type
    61    * @tparam ED the edge attribute type
    62    * @param vertices the "set" of vertices and their attributes
    63    * @param edges the collection of edges in the graph
    64    * @param defaultVertexAttr the default vertex attribute to use for vertices that are
    65    *                          mentioned in edges but not in vertices
    66    * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
    67    * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
    68    */
    69   def apply[VD: ClassTag, ED: ClassTag](
    70       vertices: RDD[(VertexId, VD)],
    71       edges: RDD[Edge[ED]],
    72       defaultVertexAttr: VD = null.asInstanceOf[VD],
    73       edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
    74       vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
    75     GraphImpl(vertices, edges, defaultVertexAttr, edgeStorageLevel, vertexStorageLevel)
    76   }
    77 
    78   /**
    79    * Implicitly extracts the [[GraphOps]] member from a graph.
    80    *
    81    * To improve modularity the Graph type only contains a small set of basic operations.
    82    * All the convenience operations are defined in the [[GraphOps]] class which may be
    83    * shared across multiple graph implementations.
    84    */
    85   implicit def graphToGraphOps[VD: ClassTag, ED: ClassTag]
    86       (g: Graph[VD, ED]): GraphOps[VD, ED] = g.ops

    源码分析:  

      Graph.apply允许根据顶点和边的RDD创建图。选取任意重复的顶点,并在边RDD中找到对应的顶点,指定这些数据为顶点的默认属性。

      Graph.fromEdges允许仅从边RDD创建图。若顶点数据不存在,则从边数据中提取。这些数据被指定为顶点的默认属性。

      Graph.fromEdgeTuple允许仅从边RDD创建图。为边设置初始值为1,并自动创建Edge及相关顶点并指定默认值。它还支持对边进行去重,此时,必须传入PartitionStrategy作为参数uniqueEdges的值(例如:uniqueEdges=Some(PartitionStrategy.RandomVertexCut))。必须使用分区策略才能使相同的边放置到同一个分区上,以便进行重复数据删除。

    二.顶点RDD

      VertexRDD[A]继承RDD[(VertexId,A)]并增加了额外的限制,每个VertexId只能创建一次。此外,VertexRDD[A]表示一组顶点,每个顶点的类型都为A。在内部,这是通过将顶点属性存储在可重用的哈希映射数据结构中来实现的。如果两个VertexRDDs是从相同的基本VertexRDD派生出来的话,则可以在恒定时间内将它们连接在一起,而无需进行哈希评估。

    源码如下:

      1 /**
      2  * @tparam VD the vertex attribute associated with each vertex in the set.
      3  */
      4 abstract class VertexRDD[VD](
      5     sc: SparkContext,
      6     deps: Seq[Dependency[_]]) extends RDD[(VertexId, VD)](sc, deps) {
      7 
      8   implicit protected def vdTag: ClassTag[VD]
      9 
     10   private[graphx] def partitionsRDD: RDD[ShippableVertexPartition[VD]]
     11 
     12   override protected def getPartitions: Array[Partition] = partitionsRDD.partitions
     13 
     14   /**
     15    * Provides the `RDD[(VertexId, VD)]` equivalent output.
     16    */
     17   override def compute(part: Partition, context: TaskContext): Iterator[(VertexId, VD)] = {
     18     firstParent[ShippableVertexPartition[VD]].iterator(part, context).next().iterator
     19   }
     20 
     21   /**
     22    * Construct a new VertexRDD that is indexed by only the visible vertices. The resulting
     23    * VertexRDD will be based on a different index and can no longer be quickly joined with this
     24    * RDD.
     25    */
     26   def reindex(): VertexRDD[VD]
     27 
     28   /**
     29    * Applies a function to each `VertexPartition` of this RDD and returns a new VertexRDD.
     30    */
     31   private[graphx] def mapVertexPartitions[VD2: ClassTag](
     32       f: ShippableVertexPartition[VD] => ShippableVertexPartition[VD2])
     33     : VertexRDD[VD2]
     34 
     35   /**
     36    * Restricts the vertex set to the set of vertices satisfying the given predicate. This operation
     37    * preserves the index for efficient joins with the original RDD, and it sets bits in the bitmask
     38    * rather than allocating new memory.
     39    *
     40    * It is declared and defined here to allow refining the return type from `RDD[(VertexId, VD)]` to
     41    * `VertexRDD[VD]`.
     42    *
     43    * @param pred the user defined predicate, which takes a tuple to conform to the
     44    * `RDD[(VertexId, VD)]` interface
     45    */
     46   override def filter(pred: Tuple2[VertexId, VD] => Boolean): VertexRDD[VD] =
     47     this.mapVertexPartitions(_.filter(Function.untupled(pred)))
     48 
     49   /**
     50    * Maps each vertex attribute, preserving the index.
     51    *
     52    * @tparam VD2 the type returned by the map function
     53    *
     54    * @param f the function applied to each value in the RDD
     55    * @return a new VertexRDD with values obtained by applying `f` to each of the entries in the
     56    * original VertexRDD
     57    */
     58   def mapValues[VD2: ClassTag](f: VD => VD2): VertexRDD[VD2]
     59 
     60   /**
     61    * Maps each vertex attribute, additionally supplying the vertex ID.
     62    *
     63    * @tparam VD2 the type returned by the map function
     64    *
     65    * @param f the function applied to each ID-value pair in the RDD
     66    * @return a new VertexRDD with values obtained by applying `f` to each of the entries in the
     67    * original VertexRDD.  The resulting VertexRDD retains the same index.
     68    */
     69   def mapValues[VD2: ClassTag](f: (VertexId, VD) => VD2): VertexRDD[VD2]
     70 
     71   /**
     72    * For each VertexId present in both `this` and `other`, minus will act as a set difference
     73    * operation returning only those unique VertexId's present in `this`.
     74    *
     75    * @param other an RDD to run the set operation against
     76    */
     77   def minus(other: RDD[(VertexId, VD)]): VertexRDD[VD]
     78 
     79   /**
     80    * For each VertexId present in both `this` and `other`, minus will act as a set difference
     81    * operation returning only those unique VertexId's present in `this`.
     82    *
     83    * @param other a VertexRDD to run the set operation against
     84    */
     85   def minus(other: VertexRDD[VD]): VertexRDD[VD]
     86 
     87   /**
     88    * For each vertex present in both `this` and `other`, `diff` returns only those vertices with
     89    * differing values; for values that are different, keeps the values from `other`. This is
     90    * only guaranteed to work if the VertexRDDs share a common ancestor.
     91    *
     92    * @param other the other RDD[(VertexId, VD)] with which to diff against.
     93    */
     94   def diff(other: RDD[(VertexId, VD)]): VertexRDD[VD]
     95 
     96   /**
     97    * For each vertex present in both `this` and `other`, `diff` returns only those vertices with
     98    * differing values; for values that are different, keeps the values from `other`. This is
     99    * only guaranteed to work if the VertexRDDs share a common ancestor.
    100    *
    101    * @param other the other VertexRDD with which to diff against.
    102    */
    103   def diff(other: VertexRDD[VD]): VertexRDD[VD]
    104 
    105   /**
    106    * Left joins this RDD with another VertexRDD with the same index. This function will fail if
    107    * both VertexRDDs do not share the same index. The resulting vertex set contains an entry for
    108    * each vertex in `this`.
    109    * If `other` is missing any vertex in this VertexRDD, `f` is passed `None`.
    110    *
    111    * @tparam VD2 the attribute type of the other VertexRDD
    112    * @tparam VD3 the attribute type of the resulting VertexRDD
    113    *
    114    * @param other the other VertexRDD with which to join.
    115    * @param f the function mapping a vertex id and its attributes in this and the other vertex set
    116    * to a new vertex attribute.
    117    * @return a VertexRDD containing the results of `f`
    118    */
    119   def leftZipJoin[VD2: ClassTag, VD3: ClassTag]
    120       (other: VertexRDD[VD2])(f: (VertexId, VD, Option[VD2]) => VD3): VertexRDD[VD3]
    121 
    122   /**
    123    * Left joins this VertexRDD with an RDD containing vertex attribute pairs. If the other RDD is
    124    * backed by a VertexRDD with the same index then the efficient [[leftZipJoin]] implementation is
    125    * used. The resulting VertexRDD contains an entry for each vertex in `this`. If `other` is
    126    * missing any vertex in this VertexRDD, `f` is passed `None`. If there are duplicates,
    127    * the vertex is picked arbitrarily.
    128    *
    129    * @tparam VD2 the attribute type of the other VertexRDD
    130    * @tparam VD3 the attribute type of the resulting VertexRDD
    131    *
    132    * @param other the other VertexRDD with which to join
    133    * @param f the function mapping a vertex id and its attributes in this and the other vertex set
    134    * to a new vertex attribute.
    135    * @return a VertexRDD containing all the vertices in this VertexRDD with the attributes emitted
    136    * by `f`.
    137    */
    138   def leftJoin[VD2: ClassTag, VD3: ClassTag]
    139       (other: RDD[(VertexId, VD2)])
    140       (f: (VertexId, VD, Option[VD2]) => VD3)
    141     : VertexRDD[VD3]
    142 
    143   /**
    144    * Efficiently inner joins this VertexRDD with another VertexRDD sharing the same index. See
    145    * [[innerJoin]] for the behavior of the join.
    146    */
    147   def innerZipJoin[U: ClassTag, VD2: ClassTag](other: VertexRDD[U])
    148       (f: (VertexId, VD, U) => VD2): VertexRDD[VD2]
    149 
    150   /**
    151    * Inner joins this VertexRDD with an RDD containing vertex attribute pairs. If the other RDD is
    152    * backed by a VertexRDD with the same index then the efficient [[innerZipJoin]] implementation
    153    * is used.
    154    *
    155    * @param other an RDD containing vertices to join. If there are multiple entries for the same
    156    * vertex, one is picked arbitrarily. Use [[aggregateUsingIndex]] to merge multiple entries.
    157    * @param f the join function applied to corresponding values of `this` and `other`
    158    * @return a VertexRDD co-indexed with `this`, containing only vertices that appear in both
    159    *         `this` and `other`, with values supplied by `f`
    160    */
    161   def innerJoin[U: ClassTag, VD2: ClassTag](other: RDD[(VertexId, U)])
    162       (f: (VertexId, VD, U) => VD2): VertexRDD[VD2]
    163 
    164   /**
    165    * Aggregates vertices in `messages` that have the same ids using `reduceFunc`, returning a
    166    * VertexRDD co-indexed with `this`.
    167    *
    168    * @param messages an RDD containing messages to aggregate, where each message is a pair of its
    169    * target vertex ID and the message data
    170    * @param reduceFunc the associative aggregation function for merging messages to the same vertex
    171    * @return a VertexRDD co-indexed with `this`, containing only vertices that received messages.
    172    * For those vertices, their values are the result of applying `reduceFunc` to all received
    173    * messages.
    174    */
    175   def aggregateUsingIndex[VD2: ClassTag](
    176       messages: RDD[(VertexId, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2]
    177 
    178   /**
    179    * Returns a new `VertexRDD` reflecting a reversal of all edge directions in the corresponding
    180    * [[EdgeRDD]].
    181    */
    182   def reverseRoutingTables(): VertexRDD[VD]
    183 
    184   /** Prepares this VertexRDD for efficient joins with the given EdgeRDD. */
    185   def withEdges(edges: EdgeRDD[_]): VertexRDD[VD]
    186 
    187   /** Replaces the vertex partitions while preserving all other properties of the VertexRDD. */
    188   private[graphx] def withPartitionsRDD[VD2: ClassTag](
    189       partitionsRDD: RDD[ShippableVertexPartition[VD2]]): VertexRDD[VD2]
    190 
    191   /**
    192    * Changes the target storage level while preserving all other properties of the
    193    * VertexRDD. Operations on the returned VertexRDD will preserve this storage level.
    194    *
    195    * This does not actually trigger a cache; to do this, call
    196    * [[org.apache.spark.graphx.VertexRDD#cache]] on the returned VertexRDD.
    197    */
    198   private[graphx] def withTargetStorageLevel(
    199       targetStorageLevel: StorageLevel): VertexRDD[VD]
    200 
    201   /** Generates an RDD of vertex attributes suitable for shipping to the edge partitions. */
    202   private[graphx] def shipVertexAttributes(
    203       shipSrc: Boolean, shipDst: Boolean): RDD[(PartitionID, VertexAttributeBlock[VD])]
    204 
    205   /** Generates an RDD of vertex IDs suitable for shipping to the edge partitions. */
    206   private[graphx] def shipVertexIds(): RDD[(PartitionID, Array[VertexId])]

    源码分析:

      基本的操作像filer,leftJoin,RightJoin和Spark SQL基本一致,用法也相同,只是处理的数据样式有所差别。另外,像独有的算子,例如:aggregateUsingIndex可以高效构建新的VertexRDD。从概念上讲,如果我们构建了VertexRDD[B]这一组数据,这是顶点A的超集,那么构建RDD[(VertexId,A)]就可以重用索引进行聚合,从而大大提高效率。

    三.边RDD

      边EdgeRDD[ED]其延伸至RDD[Edge[ED]],使用定义中的各种分区策略PatitionStrategy。在每个分区中,边属性和邻接结构分别存储,从而在更改属性值时可实现最大程度的重用。

    源码如下:

     1 abstract class EdgeRDD[ED](
     2     sc: SparkContext,
     3     deps: Seq[Dependency[_]]) extends RDD[Edge[ED]](sc, deps) {
     4 
     5   // scalastyle:off structural.type
     6   private[graphx] def partitionsRDD: RDD[(PartitionID, EdgePartition[ED, VD])] forSome { type VD }
     7   // scalastyle:on structural.type
     8 
     9   override protected def getPartitions: Array[Partition] = partitionsRDD.partitions
    10 
    11   override def compute(part: Partition, context: TaskContext): Iterator[Edge[ED]] = {
    12     val p = firstParent[(PartitionID, EdgePartition[ED, _])].iterator(part, context)
    13     if (p.hasNext) {
    14       p.next()._2.iterator.map(_.copy())
    15     } else {
    16       Iterator.empty
    17     }
    18   }
    19 
    20   /**
    21    * Map the values in an edge partitioning preserving the structure but changing the values.
    22    *
    23    * @tparam ED2 the new edge value type
    24    * @param f the function from an edge to a new edge value
    25    * @return a new EdgeRDD containing the new edge values
    26    */
    27   def mapValues[ED2: ClassTag](f: Edge[ED] => ED2): EdgeRDD[ED2]
    28 
    29   /**
    30    * Reverse all the edges in this RDD.
    31    *
    32    * @return a new EdgeRDD containing all the edges reversed
    33    */
    34   def reverse: EdgeRDD[ED]
    35 
    36   /**
    37    * Inner joins this EdgeRDD with another EdgeRDD, assuming both are partitioned using the same
    38    * [[PartitionStrategy]].
    39    *
    40    * @param other the EdgeRDD to join with
    41    * @param f the join function applied to corresponding values of `this` and `other`
    42    * @return a new EdgeRDD containing only edges that appear in both `this` and `other`,
    43    *         with values supplied by `f`
    44    */
    45   def innerJoin[ED2: ClassTag, ED3: ClassTag]
    46       (other: EdgeRDD[ED2])
    47       (f: (VertexId, VertexId, ED, ED2) => ED3): EdgeRDD[ED3]
    48 
    49   /**
    50    * Changes the target storage level while preserving all other properties of the
    51    * EdgeRDD. Operations on the returned EdgeRDD will preserve this storage level.
    52    *
    53    * This does not actually trigger a cache; to do this, call
    54    * [[org.apache.spark.graphx.EdgeRDD#cache]] on the returned EdgeRDD.
    55    */
    56   private[graphx] def withTargetStorageLevel(targetStorageLevel: StorageLevel): EdgeRDD[ED]
    57 }

    源码分析:

      单独使用情况较少,一般EdgeRDD上的操作是通过图运算符完成的,或者依赖于基类RDD中定义的操作。

  • 相关阅读:
    MySQL 资源大全中文版
    Linux性能实时监测工具 Netdata
    《Linux大棚》博客
    GNU bash实现机制与源代码简析
    C#+ArcGIS Engine 获取地图中选中的要素
    arcgis engine 获取高亮Feature、element
    DotNetBar 源码与安装版本
    ArcGIS 按多边形区域统计栅格影像的一些信息
    ArcGIS 空间查询
    55.npm install 报错 :stack Error: Can't find Python executable "python"
  • 原文地址:https://www.cnblogs.com/yszd/p/11823153.html
Copyright © 2011-2022 走看看