zoukankan      html  css  js  c++  java
  • Hadoop-MR实现日志清洗(一)

    1.日志内容样式
    目前所接触到的日志一种是网页请求日志,一种是埋点日志,一种后端系统日志。
    1.1请求日志
    请求日志是用户访问网站时,打开网址或点击网站上了项目元素时,向服务器发送或提交的资源请求。
    (论坛日志)
    27.38.53.84 - - [30/May/2013:23:37:57 +0800] "GET /uc_server/data/avatar/000/00/50/90_avatar_small.jpg HTTP/1.1" 200 1828
    218.28.247.140 - - [30/May/2013:23:37:57 +0800] "GET /static/image/common/swfupload.swf?preventswfcaching=1369928282717 HTTP/1.1" 200 13333
    123.147.245.79 - - [30/May/2013:23:37:57 +0800] "GET /static/js/swfupload.queue.js?y7a HTTP/1.1" 304 -
    182.242.227.232 - - [30/May/2013:23:37:56 +0800] "GET /misc.php?mod=patch&action=ipnotice&inajax=1&ajaxtarget=ip_notice HTTP/1.1" 200 65
    183.67.254.204 - - [30/May/2013:23:37:56 +0800] "POST /forum.php?mod=post&action=newthread&fid=72&extra=&topicsubmit=yes&inajax=1 HTTP/1.1" 200 425
    110.255.113.85 - - [30/May/2013:23:37:59 +0800] "GET /uc_server/avatar.php?uid=26294&size=middle HTTP/1.1" 301 -
    111.37.4.243 - - [30/May/2013:23:37:58 +0800] "POST /source/plugin/pcmgr_url_safeguard/url_api.inc.php HTTP/1.1" 200 1300
    125.82.229.229 - - [30/May/2013:23:38:05 +0800] "GET /uc_server/data/avatar/000/07/18/34_avatar_middle.jpg HTTP/1.1" 200 3790
    122.70.237.247 - - [30/May/2013:23:38:03 +0800] "GET /forum.php?mod=image&aid=18696&size=300x300&key=3e12991ed5ff7ecd&nocache=yes&type=fixnone&ramdom=dZqQb HTTP/1.1" 200 39594
    111.37.4.243 - - [30/May/2013:23:38:04 +0800] "GET /forum.php?mod=misc&action=postreview&do=support&tid=11228&pid=44989&hash=29c64660&infloat=yes&handlekey=login&referer=http%3A%2F%2Fbbs.itcast.cn%2Fforum.php%3Fmod%3Dviewthread%26tid%3D11228&inajax=1&ajaxtarget=fwin_content_login HTTP/1.1" 302 -
    49.5.1.14 - - [30/May/2013:23:38:09 +0800] "GET /api/connect/like.php HTTP/1.1" 200 722
    (商城日志)
    183.49.46.228 - - [18/Sep/2013:06:49:23 +0000] "-" 400 0 "-" "-"
    163.177.71.12 - - [18/Sep/2013:06:49:33 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
    163.177.71.12 - - [18/Sep/2013:06:49:36 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
    60.208.6.156 - - [18/Sep/2013:06:49:48 +0000] "GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
    222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
    222.68.172.190 - - [18/Sep/2013:06:50:08 +0000] "-" 400 0 "-" "-"
    58.215.204.118 - - [18/Sep/2013:06:51:35 +0000] "GET /nodejs-socketio-chat/ HTTP/1.1" 200 10818 "http://www.google.com/url?sa=t&rct=j&q=nodejs%20%E5%BC%82%E6%AD%A5%E5%B9%BF%E6%92%AD&source=web&cd=1&cad=rja&ved=0CCgQFjAA&url=%68%74%74%70%3a%2f%2f%62%6c%6f%67%2e%66%65%6e%73%2e%6d%65%2f%6e%6f%64%65%6a%73%2d%73%6f%63%6b%65%74%69%6f%2d%63%68%61%74%2f&ei=rko5UrylAefOiAe7_IGQBw&usg=AFQjCNG6YWoZsJ_bSj8kTnMHcH51hYQkAA&bvm=bv.52288139,d.aGc" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
    58.215.204.118 - - [18/Sep/2013:06:51:36 +0000] "GET /wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.1 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
    58.215.204.118 - - [18/Sep/2013:06:51:35 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
    58.248.178.212 - - [18/Sep/2013:06:51:40 +0000] "GET /wp-includes/js/comment-reply.min.js?ver=3.6 HTTP/1.1" 200 786 "http://blog.fens.me/nodejs-grunt-intro/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MDDR; InfoPath.2; .NET4.0C)"
    180.168.34.26 - - [18/Sep/2013:07:11:08 +0000] "-" 400 0 "-" "-"
    180.168.34.26 - - [18/Sep/2013:07:11:08 +0000] "-" 400 0 "-" "-"
    50.116.27.194 - - [18/Sep/2013:07:11:29 +0000] "POST /wp-cron.php?doing_wp_cron=1379488288.8893849849700927734375 HTTP/1.0" 200 0 "-" "WordPress/3.6; http://blog.fens.me"
    222.35.232.69 - - [18/Sep/2013:16:14:17 +0000] "GET /wp-content/uploads/2013/05/favicon.ico HTTP/1.1" 200 1150 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
    114.252.89.91 - - [18/Sep/2013:16:14:20 +0000] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 58 "http://blog.fens.me/wp-admin/post.php?post=2445&action=edit&message=10" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36"
    58.209.132.183 - - [18/Sep/2013:16:29:17 +0000] "GET /images/2.jpg HTTP/1.1" 200 105089 "http://image.baidu.com/i?ct=503316480&z=&tn=baiduimagedetail&ipn=d&word=%E6%B5%99%E6%B1%9F%E5%AE%89%E5%90%89&step_word=&ie=utf-8&in=17038&cl=2&lm=-1&st=&pn=0&rn=1&di=47839122900&ln=1998&fr=&&fmq=1379521091792_R&ic=&s=&se=&sme=0&tab=&width=&height=&face=&is=&istype=&ist=&jit=&objurl=http%3A%2F%2Fnews.eastday.com%2Feastday%2F06news%2Fchina%2Fzh2green%2Fanji%2Fnode327399%2Fimages%2F01517676.jpg" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729)"
    1.2埋点日志
    埋点日志是电商网站采用的技术手段,当用户浏览曝光的商时,主动记录曝光的商品列表、停留时间、点击的商品、点击的组件等信息,服务运营,优化商城布局,常见的埋点日志有浏览、点击、曝光日志。
    (浏览)
    2018-08-28 11:59:58,263 - site: leeyk99, ip: 188.133.207.46, refer: https://m.leeyk99.com/ru/user/login?redirection=%2Fru%2FSneakers-c-1913.html%3Ficn%3Dsneakers%26ici%3Dmru_navbar15menu01dir02&prot=1, agent: Mozilla/5.0 (Linux; Android 5.1.1; SM-G531H Build/LMY48B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.91 Mobile Safari/537.36, body: {"device_type":"m","home_site":"leeyk99","sub_site":"mru","language":"ru","money_type":"RUB","device_country":"","app_versions":"","network_type":"","ip":"","screen_pixel":"360X640","screen_size":"","device_class":0,"device_brand":"","device_name":"","device_model":"","os_type":0,"os_name":"Android","os_versions":"5.1.1","browser_name":"Chrome","browser_versions":"68.0.3440.91","session_id":"","timestamp":1535428798994,"local_time":"2018/8/28 10:59:58","device_id":"","cookie_id":"5BCE0E1F_DAFD_2E64_F24E_B3B6D5D6BAC5","member_id":"","login":0,"page_id":3,"page_name":"page_real_class","page_param":{"category_id":"1913","source_category_id":"1745"},"start_time":1535428764401,"end_time":1535428798994,"tab_page_id":"page_real_class1535428764401"}
    2018-08-28 11:59:58,272 - site: leeyk99, ip: 74.205.199.213, refer: https://m.leeyk99.com/us/Watermelon-Print-Round-Beach-Blanket-p-365584-cat-1866.html, agent: Mozilla/5.0 (Linux; Android 6.0; HUAWEI CAM-L21 Build/HUAWEICAM-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.91 Mobile Safari/537.36, body: {"device_type":"m","home_site":"leeyk99","sub_site":"mus","language":"en","money_type":"USD","device_country":"","app_versions":"","network_type":"","ip":"","screen_pixel":"360X640","screen_size":"","device_class":0,"device_brand":"","device_name":"","device_model":"","os_type":0,"os_name":"Android","os_versions":"6.0","browser_name":"Chrome","browser_versions":"68.0.3440.91","session_id":"","timestamp":1535428797165,"local_time":"2018/8/27 20:59:57","device_id":"","cookie_id":"B66A47CF_5522_DC84_F221_F0848C812BCA","member_id":"","login":0,"page_id":7,"page_name":"page_goods_detail","page_param":{"goods_id":365584,"traceid":"sm`1535428371336`B66A47CF_5522_DC84_F221_F0848C812BCA"},"start_time":1535428797165,"end_time":"","tab_page_id":"page_goods_detail1535428797165"}
    2018-08-28 11:59:58,274 - site: leeyk99, ip: 99.174.207.56, refer: https://m.leeyk99.com/us/Striped-Ringer-Tee-p-469810-cat-1738.html?rrec=true, agent: Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1, body: {"device_type":"m","home_site":"leeyk99","sub_site":"mus","language":"en","money_type":"USD","device_country":"","app_versions":"","network_type":"","ip":"","screen_pixel":"375X667","screen_size":"","device_class":0,"device_brand":"","device_name":"","device_model":"","os_type":0,"os_name":"iOS","os_versions":"11.4.1","browser_name":"Mobile Safari","browser_versions":"11.0","session_id":"","timestamp":1535428797977,"local_time":"2018/8/27 22:59:57","device_id":"","cookie_id":"D56B15A4_37D3_9164_CA60_3B4CDB382F2D","member_id":"","login":0,"page_id":7,"page_name":"page_goods_detail","page_param":{"goods_id":469810,"traceid":"sm`1535428730780`D56B15A4_37D3_9164_CA60_3B4CDB382F2D"},"start_time":1535428797977,"end_time":"","tab_page_id":"page_goods_detail1535428797977"}
    2018-08-28 11:59:58,293 - site: leeyk99, ip: 172.56.35.21, refer: https://m.leeyk99.com/us/FB-US-Striped-20180402-A-D7-vc-64042.html?utm_source=facebook.com&utm_medium=cpc&utm_campaign=fbadsus_20180408_mobmpa_Food_FB-US-Striped-20180402-A-D7-vc-64042_3554_&url_from=fbadsus_20180408_mobmpa_Food_FB-US-Striped-20180402-A-D7-vc-64042_3554_, agent: Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15G77 Instagram 24.0.0.14.205 (iPhone7,2; iOS 11_4_1; en_US; en-US; scale=2.00; gamut=normal; 750x1334), body: {"device_type":"m","home_site":"leeyk99","sub_site":"mus","language":"en","money_type":"USD","device_country":"","app_versions":"","network_type":"","ip":"","screen_pixel":"375X667","screen_size":"","device_class":0,"device_brand":"","device_name":"","device_model":"","os_type":0,"os_name":"iOS","os_versions":"11.4.1","browser_name":"WebKit","browser_versions":"605.1.15","session_id":"","timestamp":1535428797377,"local_time":"2018/8/27 23:59:57","device_id":"","cookie_id":"3BFCB287_A97B_AA24_DBCC_86DC346D3100","member_id":"","login":0,"page_id":2,"page_name":"page_virtual_class","page_param":{"category_id":"64042"},"start_time":1535428797377,"end_time":"","tab_page_id":"page_virtual_class1535428797377"}
    点击、曝光的日志内容与浏览的类似,根据埋点需求不同,采集记录的数据略有不同,记录的核心内容就是body里的内容。
    埋点日志是根据需求设计记录的内容,格式齐整,内容规范,一般使用Hive-正则即可进行过滤入库,像这个浏览日志,只需要创建一张表,指定以下正则格式,即可入库使用日志:
    'input.regex'='([0-9\.\- :,]+) \- site: ([\w]+), ip: ([0-9\.\- :,]+), refer: (.*), agent: (.*), body: ([\[\{].*[\}\]])'

     
    1.3后端系统日志
    后端系统日志是系统自己主动记录的,通常是前端或其他系统向后端系统请求接口数据,后端系统记录接口请求信息或接口返回结果信息。这种数据通常是系统间约定好的,因此是格式非常规范的日志数据,也可以直接使用Hive的正则技术处理数据。
    例如:
    格式一:(结果信息)
    2018-07-03 06:50:00,142 [XNIO-2 task-28] INFO  com.leeyk99.bi.abt.rest.CoreApiController - 1A42F7C6_B904_A334_AB87_5A69A7034DA0  leeyk99PcRealClass 66 158
    格式二:(接口信息)
    2018-07-03 20:39:46,043 [XNIO-2 task-211] INFO  com.leeyk99.bi.abt.filter.LogFilter - GET  /api/v1/bi/abt?cid=973EA838_E20E_74E4_41AB_E218DA91D73E&uid=&site=mtw&terminal=leeyk99-M&lan=zh-tw took 1ms and returned 200
     
    (1.21.3中的leeyk99是对源数据中某个公司品牌的替换)
     
    关于Hive正则技术处理比较规范的日志数据,可以查看:https://www.cnblogs.com/leeyuki/p/9548811.html (博客园)或者 ABT日志入库记录 (印象笔记)
    本篇学习使用Hadoop-MR清洗请求日志。
     
    2.请求日志采集入库
    对于日志文件的采集,我们数仓一般不会直接去生产系统去采集,而是由运维或者专门的小组负责日志采集,一般是采集落到HDFS或S3文件系统上或者某台接口机上,数仓采集入库这些文件,进行清洗加工。
    ELK结构(Elasticsearch , Logstash, Kibana )提供了一整套解决方案,并且都是开源软件,之间互相配合使用,完美衔接,高效的满足了很多场合的应用,这个结构是面向平台或系统用户的,用来查看监视日志,跟踪系统运行状况的。
    Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。
    • Flume+Kafka+Storm+mysql构建大数据实时系统
    • Flume+HDFS+KafKa+Strom实现实时推荐,反爬虫服务等服务
    • Flume+Hadoop+Hive的离线分析网站用户浏览行为路径
    • Flume+Logstash+Kafka+Spark Streaming进行实时日志处理分析
    • Flume+Spark + ELK数据系统实时监控平台
    ftp文件传输也是一种非常重要的文件服务方式,但对于大量的日志可能不太适用。除非是日志离线归档收集好,再传输到接口机上供第三方取用。
    关于实时收集等模式,暂无涉猎。
     
    3.配置Maven-Hadoop环境
    3.1.项目初始化
    <groupId>com.leeyk99.udp</groupId>
    <artifactId>hadoop-mapreduce</artifactId>
    <version>1.0-SNAPSHOT</version>
    目标:创建一个Maven项目,配置Hadoop运行环境需要的Jar文件。
     
    3.2.配置pom.xml
    配置Hadoop运行需要的JAR文件
    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
     
        <groupId>com.leeyk99.udp</groupId>
        <artifactId>hadoop-mapreduce</artifactId>
        <version>1.0-SNAPSHOT</version>
     
       <!-- <packaging>jar</packaging>-->
     
        <dependencies>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-core</artifactId>
                <version>1.2.1</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-common</artifactId>
                <version>2.7.6</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-hdfs</artifactId>
                <version>2.7.6</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-client</artifactId>
                <version>2.7.6</version>
            </dependency>
            <dependency>
                <groupId>log4j</groupId>
                <artifactId>log4j</artifactId>
                <version>1.2.17</version>
            </dependency>
        </dependencies>
        <!--<build>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.3</version>
                    <configuration>
                        <source>1.7</source>
                        <target>1.7</target>
                    </configuration>
                </plugin>
            </plugins>
        </build>-->
    </project>
    关于IDEA上Maven项目JAR文件自动下载配置,参考笔记 Maven
    自动下载后,IDEA给该Maven项目下载了很多JAR文件(External Libraries下),除了我们自己配置的核心文件,还有相关必要的文件也被下载了, 省去了我们逐个下载的麻烦。
     
     
     
     
     
  • 相关阅读:
    定义一个Dog类,它和静态数据成员Dogs记录Dog的个体数目。静态成员函数GetDogs用来存取Dogs。设计并测试这个类--简单
    互联网无插件直播流媒体服务器方案EasyNVR下载新的软件执行程序,出现“invalid license”字样是什么意思?
    视频流媒体服务器RTSP拉流、RTMP推流方案EasyNVR如何实现视频转推其他直播间?
    视频流媒体服务器RTSP拉流、RTMP推流流媒体服务器授权方案之加密机运行后无法授权问题解决
    RTSP安防网络摄像头/海康大华硬盘录像机网页无插件直播之EasyNVR流媒体服务器系列产品直播延时问题解析
    海康大华网络摄像头RTSP_Onvif网页无插件直播流媒体服务器EasyNVR录像版设定录像文件存储位置的方法解析
    同一路摄像头视频流接入RTSP_Onvif网页无插件直播流媒体服务器EasyNVR与其他平台播放视频有差异的原因分析
    RTSP_Onvif安防摄像头直播流媒体服务器EasyNVR产品调用接口出现"Unauthorized"问题的解决方法
    安防摄像头RTSP/Onvif协议网页无插件直播视频流媒体服务器EasyNVR录像回看质量的影响因素有哪些?
    海康、大华等网络摄像头RTSP_Onvif网页无插件直播流媒体服务器EasyNVR在内网环境下,设备不在线问题处理
  • 原文地址:https://www.cnblogs.com/leeyuki/p/9560793.html
Copyright © 2011-2022 走看看