zoukankan      html  css  js  c++  java
  • PySpark的选择_筛选_聚合_表连接

    PySpark之选择特征select、筛选filter、聚合运算、group by、join table、inner join 、left join、right join、full outer join,如下所示:

    from __future__ import print_function, division
    from pyspark import SparkConf, SparkContext
    from pyspark.sql import SparkSession

    ## 启动 Spark (如果你已经启动就不需要)
    spark = SparkSession.builder.master("local[2]").appName("test").enableHiveSupport().getOrCreate()
    sc = spark.sparkContext

    ## 读取数据集csv
    df = spark.read.csv('../data/rating.csv', sep = ',', header = True) #自动判断格式interSchema=True
    df.show()

    ## 选择特征 ---select
    #select userid from data   -- sql语句
    df.select('userid').show() #筛选userid
    #select userid ,movieid from data   -- sql语句
    df.select('userid','movieid').show() 

    ## 对特征进行操作 ---selectExpr
    #select userid as id from data
    df.selectExpr('userid as id').show()
    #select movieid,rating * 2  as rating_2 from data
    df.selectExpr('movieid', 'rating * 2 as rating_2').show()
    df.printSchema()
    df.selectExpr('cast(taring as DOUBLE)').printSchema() #转换类型cast

    ## 筛选userid --- filter
    #select * from data where rating > 3
    df.filter('rating > 3').show()
    #select * from data where userid = 2 and rating > 3
    df.filter('userid == 2 and rating > 3').show()
    #select userid, rating from data where userid = 2 and rating > 3
    df.filter('userid == 2 and rating > 3').select('userid', 'rating').show()
    df.select("userID", "rating").filter("userID = 2 and rating > 3").show()

    ## 聚合运算
    #select count(*) from data
    df.count()
    df.agg({'userid':'count'}).show()
    #select count(*) from data where userid = 1
    df.filter('userid = 1').count()
    #select count(userid) from data,select avg(rating) from data
    df.agg({'userid':'count','rating':'avg'}).show()

    ## group by
    ##计算每个user评比了多少部电影,平均分数如何?
    # select userid,count(*),avg(rating) from data group by userid
    df.groupBy('userid').agg({'movieid':'count','rating':'avg'}).show()
    from pyspark.sql.function import *
    df.groupBy('userid').agg(count('movieid'), round(avg(df.rating), 2)).show()

    ## join table、inner join 、left join、right join、full outer join
    # 创建数据框df_profile
    d = [{'name': 'Alice', 'age': 1}, {'name': 'Bryan', 'age': 3}, {'name': 'Cool', 'age':2}]
    df_profile = spark.createDataFrame(d) #转换为数据框
    df_profile.show()
    # 创建数据框df_parents
    d = [{'name': 'Jason', 'child': 'Alice'}, 
         {'name': 'Bill', 'child': 'Bryan'}, 
         {'name': 'Sera', 'child': 'Bryan'}, 
         {'name': 'Jill', 'child': 'Ken'}]
    df_parents = spark.createDataFrame(d) #转换为数据框
    df_parents.show()
    #inner join 
    df_profile.join(df_parents, df_profile.name == df_parents.child).show()
    #left join
    df_profile.join(df_parents, df_profile.name == df_parents.child, 'left').show()
    #right join
    df_profile.join(df_parents, df_profile.name == df_parents.child, 'right').show()
    #full outer join
    df_profile.join(df_parents, df_profile.name == df_parents.child, 'outer').show()
    ---------------------

  • 相关阅读:
    第二十二章 Django会话与表单验证
    第二十章 Django数据库实战
    第十九章 Django的ORM映射机制
    第十八章 DjangoWeb开发框架
    第三章 函数与变量
    第二章 基本数据结构
    第一章 介绍与循环
    pyhton 关于 configparser 配置 模块 实践使用中碰到的坑
    easyui----combo组件
    servlet 单例多线程
  • 原文地址:https://www.cnblogs.com/jeasonit/p/10075538.html
Copyright © 2011-2022 走看看