union、intersection subtract 都是transformation 算子
1、union 合并2个数据集,2个数据集的类型要求一致,返回的新RDD的分区数是合并RDD分区数的总和;
val kzc=spark.sparkContext.parallelize(List(("hive",8),("apache",8),("hive",30),("hadoop",18)),2) val bd=spark.sparkContext.parallelize(List(("hive",18),("test",2),("spark",20)),1) val result=bd.union(kzc) println(result.partitions.size) println("*******************") result.collect().foreach(println(_))
结果
3 ******************* (hive,18) (test,2) (spark,20) (hive,8) (apache,8) (hive,30) (hadoop,18)
2、intersection 取交集,新RDD的分区与父RDD分区数多的一致
spark.sparkContext.setLogLevel("error") val kzc=spark.sparkContext.parallelize(List(("hive",8),("apache",8),("hive",30),("hadoop",18)),2) val bd=spark.sparkContext.parallelize(List(("hive",8),("test",2),("spark",20)),1) val result=bd.intersection(kzc) println(result.partitions.size) println("*******************") result.collect().foreach(println(_))
结果
2 ******************* (hive,8)
3、subtract,减去二者之间的交集(intersection),新RDD与subtract前边的父RDD分区数一致
spark.sparkContext.setLogLevel("error") val kzc=spark.sparkContext.parallelize(List(("hive",8),("apache",8),("hive",30),("hadoop",18)),2) val bd=spark.sparkContext.parallelize(List(("hive",8),("test",2),("spark",20)),1) val result=bd.subtract(kzc) println(result.partitions.size) println("*******************") result.collect().foreach(println(_))
结果
1 ******************* (test,2) (spark,20)