zoukankan      html  css  js  c++  java
  • 【Spark】通过Spark实现点击流日志分析


    数据大致内容及格式

    194.237.142.21 - - [18/Sep/2013:06:49:18 +0000] "GET /wp-content/uploads/2013/07/rstudio-git3.png HTTP/1.1" 304 0 "-" "Mozilla/4.0 (compatible;)"
    183.49.46.228 - - [18/Sep/2013:06:49:23 +0000] "-" 400 0 "-" "-"
    163.177.71.12 - - [18/Sep/2013:06:49:33 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
    163.177.71.12 - - [18/Sep/2013:06:49:36 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
    101.226.68.137 - - [18/Sep/2013:06:49:42 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
    101.226.68.137 - - [18/Sep/2013:06:49:45 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
    60.208.6.156 - - [18/Sep/2013:06:49:48 +0000] "GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
    222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
    ……
    ……
    

    统计PV(PageViews)

    就是统计日志文件中有多少条数据
    关于点击流日志的各种指标可以查看【Hadoop离线基础总结】网站流量日志数据分析系统

    import org.apache.spark.rdd.RDD
    import org.apache.spark.{SparkConf, SparkContext}
    
    object PvCount {
      def main(args: Array[String]): Unit = {
    
        //获取SparkConf
        val sparkConf = new SparkConf().setMaster("local[2]").setAppName("PV-Count").set("spark.driver.host", "localhost")
        //创建SparkContext
        val sparkContext = new SparkContext(sparkConf)
        //读取文件
        val fileRDD: RDD[String] = sparkContext.textFile("/Users/zhaozhuang/Desktop/4、Spark/Spark第二天/第二天教案/资料/运营商日志/access.log")
        //统计数量
        val count = fileRDD.count()
    
        println("一共有"+count+"行数据")
    
        sparkContext.stop()
      }
    }
    

    经统计后得出,数据有 14619条,也就是说PV量为14619


    统计UV(Unique Visitor)

    实际工作中,一般推荐用cookie而不是IP地址来对UV进行统计,但这里数据只有IP地址,所以目前就按IP算

    import org.apache.spark.rdd.RDD
    import org.apache.spark.{SparkConf, SparkContext}
    
    object UvCount {
      def main(args: Array[String]): Unit = {
        //获取SparkConf
        val sparkConf = new SparkConf().setAppName("UV-Count").setMaster("local[2]").set("spark.driver.host","localhost")
        //创建SparkContext
        val sparkContext = new SparkContext(sparkConf)
        //筛选日志
        sparkContext.setLogLevel("WARN")
        //读取文件
        val fileRDD: RDD[String] = sparkContext.textFile("/Users/zhaozhuang/Desktop/4、Spark/Spark第二天/第二天教案/资料/运营商日志/access.log")
        //从所有数据中剔除掉不需要的数据,只拿到IP地址
        val getIpRDD: RDD[String] = fileRDD.map(_.split(" ")(0))
        //对IP地址进行去重,去重后数据减少,就可以将分区缩减为1个
        val distinctedRDD: RDD[String] = getIpRDD.distinct(1)
        //对去重后的数据进行计数统计
        val count: Long = distinctedRDD.count()
    
        println(count)
    
        sparkContext.stop()
      }
    }
    

    统计得出UV量为1050


    求取TopN

    有两种方法可以用,take()top() 都可以

    import org.apache.spark.rdd.RDD
    import org.apache.spark.{SparkConf, SparkContext}
    
    object GetTopN {
      def main(args: Array[String]): Unit = {
        //获取SparkConf
        val sparkConf = new SparkConf().setMaster("local[2]").set("spark.driver.host", "localhost").setAppName("getTopN")
        //获取SparkContext
        val sparkContext: SparkContext = new SparkContext(sparkConf)
        //读取文件
        val fileRDD: RDD[String] = sparkContext.textFile("/Users/zhaozhuang/Desktop/4、Spark/Spark第二天/第二天教案/资料/运营商日志/access.log")
        //筛选日志
        sparkContext.setLogLevel("WARN")
    
        //194.237.142.21 - - [18/Sep/2013:06:49:18 +0000] "GET /wp-content/uploads/2013/07/rstudio-git3.png HTTP/1.1" 304 0 "-" "Mozilla/4.0 (compatible;)"
        //以上是数据格式,首先对数据进行切割
        val valueRDD: RDD[Array[String]] = fileRDD.map(x => x.split(" "))
        /*
        数据切割后的形式
        194.237.142.21
        -
        -
        [18/Sep/2013:06:49:18
         +0000]
         "GET
         /wp-content/uploads/2013/07/rstudio-git3.png
         HTTP/1.1"
         304
         0
         "-"
         "Mozilla/4.0
         (compatible;)"
         */
        //日志数据中,下标为10的数据为我们要求取的数据(http_refer),所以切割后数组中少于10条的为无效数据
        //先将无效数据过滤掉
        val filterRDD: RDD[Array[String]] = valueRDD.filter(arr => arr.length > 10)
        //获取每一个http_refer的url,并计作一次
        val urlAndOne: RDD[(String, Int)] = filterRDD.map(x => (x(10), 1))
        //将url相同的次数相加
        val reduceRDD: RDD[(String, Int)] = urlAndOne.reduceByKey(_ + _)
        //将拿到的url+次数进行排序,false为降序,不填或true为升序
        val sortRDD: RDD[(String, Int)] = reduceRDD.sortBy(x => x._2, false)
        //求取TopN,两种方法take(N)或者top(N)
        val topRDD: Array[(String, Int)] = sortRDD.take(10)
    
        println(topRDD.toBuffer)
        sparkContext.stop()
      }
    }
    

    拿到控制台结果为:
    ArrayBuffer(("-",5205), (“http://blog.fens.me/category/hadoop-action/”,547), (“http://blog.fens.me/”,377), (“http://blog.fens.me/wp-admin/post.php?post=2445&action=edit&message=10”,360), (“http://blog.fens.me/r-json-rjson/”,274), (“http://blog.fens.me/angularjs-webstorm-ide/”,271), (“http://blog.fens.me/wp-content/themes/silesia/style.css”,228), (“http://blog.fens.me/nodejs-express3/”,198), (“http://blog.fens.me/hadoop-mahout-roadmap/”,182), (“http://blog.fens.me/vps-ip-dns/”,176))

  • 相关阅读:
    动态规划-神奇的口袋V1
    独立项目-建立Web服务器-00
    连接数据库时出现:SQL Server 建立连接时出现与网络相关的或特定于实例的错误
    独立项目-MemoryStream-内存数据读写-01
    独立项目-场景刷怪、小怪AI、主角战斗、小怪死亡-01
    独立项目-角色控制器FSM-FSM有限状态机-03
    独立项目-角色控制器FSM-AnimatorController学习-02
    独立项目-角色控制器FSM-学习内容-01
    Unity中的UGUI之Rect Transform__02
    矩阵的平移、旋转与缩放
  • 原文地址:https://www.cnblogs.com/zzzsw0412/p/12772393.html
Copyright © 2011-2022 走看看