zoukankan      html  css  js  c++  java
  • spark之客户流失预测

    一.数据

      这是一个Udacity纳米学位项目(数据科学的顶点)。这个项目使用来自Sparkify的用户事件数据来建立一个模型来预测用户的流失。Sparkify是一个类似于Spotify或Pandora的数字化音乐服务(类似于网易云音乐和QQ音乐的音乐平台)。使用Spark分析探索某数字音乐服务平台Sparkify(类似于网易云音乐和QQ音乐的音乐平台)2016年10月1日-2016年12月1日期间用户在该平台上的行为数据。通过对用户行为和用户信息的分析,提取可能对预测用户是否流失有帮助的相关特征,从而建立流失用户预测模型。

    以下是APP的日志数据,包含226个不同用户的信息,从为一首歌的点赞到更改账户设置,都有详细的操作。

    数据集名称:mini_sparkify_event_data.json,原始数据12G,这里使用子集123M,可自行网上搜索下载。
    数据描述:

    artist    音乐信息    歌手名称
    auth    网页信息    用户进入平台的方式
    firstName    用户信息    用户的名
    gender    用户信息    用户性别:F为女,M为男
    itemInSession    网页信息    会话顺序
    lastName    用户信息    用户的姓氏
    length    音乐信息    音乐时长(秒)
    level    Event    用户等级:free为免费用户,paid为付费用户
    location    会话信息    用户在会话期间所属位置
    method    网页信息    HTTP method ,GET 或者 PUT
    page    网页信息    用户行为类型
    registration    用户信息    用户注册时间
    sessionId    会话信息    会话编号
    song    音乐信息    歌曲名称
    status    网页信息    HTTP状态编码. 2xx=Successful, 3xx=Redirection, 4xx=Client Error.
    ts    网页信息    用户行为发生的时间
    userAgent    会话信息    网络环境,所属浏览器
    userId    用户信息    用户编码,具有唯一性

    查看属性:

             +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
             |          artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent|userId|
             +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
             |  Martha Tilston|Logged In|    Colin|     M|           50| Freeman|277.89016| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|           Rockpools|   200|1538352117000|Mozilla/5.0 (Wind...|    30|
             |Five Iron Frenzy|Logged In|    Micah|     M|           79|    Long|236.09424| free|Boston-Cambridge-...|   PUT|NextSong|1538331630000|        8|              Canada|   200|1538352180000|"Mozilla/5.0 (Win...|     9|
             |    Adam Lambert|Logged In|    Colin|     M|           51| Freeman| 282.8273| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|   Time For Miracles|   200|1538352394000|Mozilla/5.0 (Wind...|    30|
             |          Enigma|Logged In|    Micah|     M|           80|    Long|262.71302| free|Boston-Cambridge-...|   PUT|NextSong|1538331630000|        8|Knocking On Forbi...|   200|1538352416000|"Mozilla/5.0 (Win...|     9|
             |       Daft Punk|Logged In|    Colin|     M|           52| Freeman|223.60771| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|Harder Better Fas...|   200|1538352676000|Mozilla/5.0 (Wind...|    30|
             +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
             

    二.代码

    详细代码见:RF用户流失预测(https://github.com/jiangnanboy/spark_tutorial)

    public static void churnAnalysis(SparkSession session) {
            /**
             * 加载数据
             * 这是日志数据,包含226个不同用户的信息,从为一首歌点赞到更改账户设置,都有详细的操作。
             *
             * 数据描述
             * artist: Artist name (ex. Daft Punk)
             * auth: User authentication status (ex. Logged)
             * firstName: User first name (ex. Colin)
             * gender: Gender (ex. F or M)
             * itemInSession: Item count in a session (ex. 52)
             * lastName: User last name (ex. Freeman)
             * length: Length of song (ex. 223.60771)
             * level: User plan (ex. paid)
             * location: User's location (ex. Bakersfield)
             * method: HTTP method (ex. PUT)
             * page: Page name (ex. NextSong)
             * registration: Registration timestamp (unix timestamp) (ex. 1538173362000)
             * sessionId: Session ID (ex. 29)
             * song: Song (ex. Harder Better Faster Stronger)
             * status: HTTP status (ex. 200)
             * ts: Event timestamp(unix timestamp) (ex. 1538352676000)
             * userAgent: User's browswer agent (ex. Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0)
             * userId: User ID (ex. 30)
             *
             * +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
             * |          artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent|userId|
             * +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
             * |  Martha Tilston|Logged In|    Colin|     M|           50| Freeman|277.89016| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|           Rockpools|   200|1538352117000|Mozilla/5.0 (Wind...|    30|
             * |Five Iron Frenzy|Logged In|    Micah|     M|           79|    Long|236.09424| free|Boston-Cambridge-...|   PUT|NextSong|1538331630000|        8|              Canada|   200|1538352180000|"Mozilla/5.0 (Win...|     9|
             * |    Adam Lambert|Logged In|    Colin|     M|           51| Freeman| 282.8273| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|   Time For Miracles|   200|1538352394000|Mozilla/5.0 (Wind...|    30|
             * |          Enigma|Logged In|    Micah|     M|           80|    Long|262.71302| free|Boston-Cambridge-...|   PUT|NextSong|1538331630000|        8|Knocking On Forbi...|   200|1538352416000|"Mozilla/5.0 (Win...|     9|
             * |       Daft Punk|Logged In|    Colin|     M|           52| Freeman|223.60771| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|Harder Better Fas...|   200|1538352676000|Mozilla/5.0 (Wind...|    30|
             * +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
             */
            String path = PropertiesReader.get("mini_sparkify_event_data"); //数据集自行下载
            Dataset<Row> dataset = session.read().json(path);
            dataset.persist(StorageLevel.MEMORY_AND_DISK());
            dataset.show(5);
            /**
             *
             */
    
            //创建视图
            dataset.createOrReplaceTempView("userlogs");
    
            String[] columnsName = dataset.columns();
            Column[] columns = new Column[columnsName.length];
            for(int index = 0;index < columnsName.length; index++) {
                columns[index] = functions.count(functions.when(functions.isnull(col(columnsName[index])), columnsName[index])).as(columnsName[index]);
            }
            /**
             * 查看每个列为null的数量
             *
             * +------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+
             * |artist|auth|firstName|gender|itemInSession|lastName|length|level|location|method|page|registration|sessionId| song|status| ts|userAgent|userId|
             * +------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+
             * | 58392|   0|     8346|  8346|            0|    8346| 58392|    0|    8346|     0|   0|        8346|        0|58392|     0|  0|     8346|     0|
             * +------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+
             */
            dataset.select(columns);
    
            /**
             * 列“firstName”中的缺失值
             * +------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
             * |artist|      auth|firstName|gender|itemInSession|lastName|length|level|location|method| page|registration|sessionId|song|status|           ts|userAgent|userId|
             * +------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
             * |  null|Logged Out|     null|  null|          100|    null|  null| free|    null|   GET| Home|        null|        8|null|   200|1538355745000|     null|      |
             * |  null|Logged Out|     null|  null|          101|    null|  null| free|    null|   GET| Help|        null|        8|null|   200|1538355807000|     null|      |
             * |  null|Logged Out|     null|  null|          102|    null|  null| free|    null|   GET| Home|        null|        8|null|   200|1538355841000|     null|      |
             * |  null|Logged Out|     null|  null|          103|    null|  null| free|    null|   PUT|Login|        null|        8|null|   307|1538355842000|     null|      |
             * |  null|Logged Out|     null|  null|            2|    null|  null| free|    null|   GET| Home|        null|      240|null|   200|1538356678000|     null|      |
             * |  null|Logged Out|     null|  null|            3|    null|  null| free|    null|   PUT|Login|        null|      240|null|   307|1538356679000|     null|      |
             * |  null|Logged Out|     null|  null|            0|    null|  null| free|    null|   PUT|Login|        null|      100|null|   307|1538358102000|     null|      |
             * |  null|Logged Out|     null|  null|            0|    null|  null| free|    null|   PUT|Login|        null|      241|null|   307|1538360117000|     null|      |
             * |  null|Logged Out|     null|  null|           14|    null|  null| free|    null|   GET| Home|        null|      187|null|   200|1538361527000|     null|      |
             * |  null|Logged Out|     null|  null|           15|    null|  null| free|    null|   PUT|Login|        null|      187|null|   307|1538361528000|     null|      |
             * +------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
             */
            dataset.where(col("firstName").isNaN()).show(10);
    
            /**
             * 列"artist"中的缺失值
             * +------+----------+---------+------+-------------+--------+------+-----+--------------------+------+---------------+-------------+---------+----+------+-------------+--------------------+------+
             * |artist|      auth|firstName|gender|itemInSession|lastName|length|level|            location|method|           page| registration|sessionId|song|status|           ts|           userAgent|userId|
             * +------+----------+---------+------+-------------+--------+------+-----+--------------------+------+---------------+-------------+---------+----+------+-------------+--------------------+------+
             * |  null| Logged In|    Colin|     M|           54| Freeman|  null| paid|     Bakersfield, CA|   PUT|Add to Playlist|1538173362000|       29|null|   200|1538352905000|Mozilla/5.0 (Wind...|    30|
             * |  null| Logged In|    Micah|     M|           84|    Long|  null| free|Boston-Cambridge-...|   GET|    Roll Advert|1538331630000|        8|null|   200|1538353150000|"Mozilla/5.0 (Win...|     9|
             * |  null| Logged In|    Micah|     M|           86|    Long|  null| free|Boston-Cambridge-...|   PUT|      Thumbs Up|1538331630000|        8|null|   307|1538353376000|"Mozilla/5.0 (Win...|     9|
             * |  null| Logged In|    Alexi|     F|            4|  Warren|  null| paid|Spokane-Spokane V...|   GET|      Downgrade|1532482662000|       53|null|   200|1538354749000|Mozilla/5.0 (Wind...|    54|
             * |  null| Logged In|    Alexi|     F|            7|  Warren|  null| paid|Spokane-Spokane V...|   PUT|      Thumbs Up|1532482662000|       53|null|   307|1538355255000|Mozilla/5.0 (Wind...|    54|
             * |  null| Logged In|    Micah|     M|           95|    Long|  null| free|Boston-Cambridge-...|   PUT|    Thumbs Down|1538331630000|        8|null|   307|1538355306000|"Mozilla/5.0 (Win...|     9|
             * |  null| Logged In|    Micah|     M|           97|    Long|  null| free|Boston-Cambridge-...|   GET|           Home|1538331630000|        8|null|   200|1538355504000|"Mozilla/5.0 (Win...|     9|
             * |  null| Logged In|    Micah|     M|           99|    Long|  null| free|Boston-Cambridge-...|   PUT|         Logout|1538331630000|        8|null|   307|1538355687000|"Mozilla/5.0 (Win...|     9|
             * |  null| Logged In|  Ashlynn|     F|            9|Williams|  null| free|     Tallahassee, FL|   PUT|      Thumbs Up|1537365219000|      217|null|   307|1538355711000|"Mozilla/5.0 (Mac...|    74|
             * |  null|Logged Out|     null|  null|          100|    null|  null| free|                null|   GET|           Home|         null|        8|null|   200|1538355745000|                null|      |
             * +------+----------+---------+------+-------------+--------+------+-----+--------------------+------+---------------+-------------+---------+----+------+-------------+--------------------+------+
             */
            dataset.where(col("artist").isNaN()).show(10);
    
            /**
             * 过滤注销的用户
             */
            dataset = dataset.where(col("auth").notEqual("Logged Out"));
    
            /**通过以上分析可知:
             * 虽然在较高的级别上userId或sessionId列中没有丢失的值,但进一步查看firstName列中丢失的值,可以发现注销的用户拥有空(但不是null)的用户ID。排除这些用户。
             * 列artist中也有缺失的值,但这些值对应于与音乐无关的动作的日志(如“Add to Playlist”、“Roll Advert”等)。只要这些用户是登录的,我们就希望保留这个activity,因为他们可能是分类的重要行为标记。
             */
    
    
            /**
             *  |-- artist: string (nullable = true)
             *  |-- auth: string (nullable = true)
             *  |-- firstName: string (nullable = true)
             *  |-- gender: string (nullable = true)
             *  |-- itemInSession: long (nullable = true)
             *  |-- lastName: string (nullable = true)
             *  |-- length: double (nullable = true)
             *  |-- level: string (nullable = true)
             *  |-- location: string (nullable = true)
             *  |-- method: string (nullable = true)
             *  |-- page: string (nullable = true)
             *  |-- registration: long (nullable = true)
             *  |-- sessionId: long (nullable = true)
             *  |-- song: string (nullable = true)
             *  |-- status: long (nullable = true)
             *  |-- ts: long (nullable = true)
             *  |-- userAgent: string (nullable = true)
             *  |-- userId: string (nullable = true)
             */
            dataset.printSchema();
    
            /**
             * 对用户事件动作group,count
             *
             * +--------------------+------+
             * |                Page| count|
             * +--------------------+------+
             * |              Cancel|    52|
             * |    Submit Downgrade|    63|
             * |         Thumbs Down|  2546|
             * |                Home| 10118|
             * |           Downgrade|  2055|
             * |         Roll Advert|  3933|
             * |              Logout|  3226|
             * |       Save Settings|   310|
             * |Cancellation Conf...|    52|
             * |               About|   509|
             * | Submit Registration|     5|
             * |            Settings|  1514|
             * |            Register|    18|
             * |     Add to Playlist|  6526|
             * |          Add Friend|  4277|
             * |            NextSong|228108|
             * |           Thumbs Up| 12551|
             * |                Help|  1477|
             * |             Upgrade|   499|
             * |               Error|   253|
             * +--------------------+------+
             */
            dataset.groupBy("Page").count().show();
    
            /**
             * 不同的用户数量
             *+--------+
             * |nb_users|
             * +--------+
             * |     226|
             * +--------+
             */
            session.sql("select count(distinct userId) as nb_users from userlogs").show();
    
            /**
             * 创建数据集,包括用户id和标签(是否流失)
             *
             * 定义:列Page中值为"Cancellation Confirmation"为流失,其它非流失
             */
    
            Dataset<Row> churnDataset = session.sql("select distinct userId, 1 as churn from userlogs where Page='Cancellation Confirmation'");
            Dataset<Row> noChurnDataset = session.sql("select distinct userId, 0 as churn from userlogs where userId not in (select distinct userId from userlogs where Page='Cancellation Confirmation')");
    
            /**
             * union churnDataset  noChurnDataset,shuffling the rows
             */
            Dataset<Row> unionDataset = churnDataset.union(noChurnDataset);
            unionDataset.createOrReplaceTempView("churn");
            unionDataset = session.sql("select * from churn order by rand()");
            unionDataset.createOrReplaceTempView("churn");
    
            /**
             * churn    userId
             *  0         174
             *  1         52
             */
            unionDataset.groupBy(col("churn")).count().show();
    
            /**
             * 以上通过定义page为“Cancellation Confirmation”为可以流失用户。
             * 这种方法使我们能够在用户流失之前研究其行为,尝试建立预测模型并提取表明将来有流失风险的行为。
             * 从以上可知建立的数据集不平衡,在建立模型前可进行采样缓解这种不平衡带来的不准确问题。
             */
          ......
  • 相关阅读:
    SpringBoot2.x入门教程:引入jdbc模块与JdbcTemplate简单使用
    2.5万字长文简单总结SpringMVC请求参数接收
    SpringBoot2.x入门:使用CommandLineRunner钩子接口
    百万级别数据Excel导出优化
    SpringBoot2.x入门:应用打包与启动
    谈谈对分布式事务的一点理解和解决方案
    3W字干货深入分析基于Micrometer和Prometheus实现度量和监控的方案
    SpringBoot2.x入门教程:理解配置文件
    SpringBoot2.x入门:引入web模块
    SpringBoot2.x入门:快速创建一个SpringBoot应用
  • 原文地址:https://www.cnblogs.com/little-horse/p/14018469.html
Copyright © 2011-2022 走看看