zoukankan      html  css  js  c++  java
  • web日志采集实战

    为了采集网站访问日志,构建了一套日志采集系统,使用js探针的方式采集请求数据,避免了使用web服务器访问日志采集带来的大量无效数据(js,css等的请求,占比达到70%左右).

     先来看一下整体的流程图:

    • 应用服务器搭建

    安装nginx,修改配置文件(/etc/nginx/conf.d/default.conf)

    server {
      listen 80;
      server_name spark2;

      location / {
        root /data/nginx/app;
        index index.html index.htm;
        access_log on;
      }
    }

    添加html页面index.html,content.html
    <!DOCTYPE html>
    <html lang="en">
    <head>
    <meta charset="UTF-8">
    <title>首页</title>
    </head>
    <body>
    <a href="content.html">hello nginx</a>
    
    <script type="text/javascript" src="track.js"></script>
    </body>
    </html>
    <!DOCTYPE html>
    <html lang="en">
    <head>
    <meta charset="UTF-8">
    <title>内容</title>
    </head>
    <body>
    
    <h3>来看内容啊</h3>
    
    <script type="text/javascript" src="track.js"></script>
    </body>
    </html>
    启动nginx(service nginx start) 
     
    • js探针的实现

    页面嵌入js

    <script type="text/javascript">
        var _maq = _maq || [];
        _maq.push(['_setAccount', 'zx5352']);
     
        (function() {
            var ma = document.createElement('script'); 
            ma.type = 'text/javascript';
            ma.async = true;
            ma.src = 'http://flow.itcast.zx/ma.js';
            var s = document.getElementsByTagName('script')[0]; 
            s.parentNode.insertBefore(ma, s);
        })();
    </script>

    track.js

    (function () {
        var params = {};
        //Document对象数据
        if(document) {
            params.domain = document.domain || ''; 
            params.url = document.URL || ''; 
            params.title = document.title || ''; 
            params.referrer = document.referrer || ''; 
        }   
        //Window对象数据
        if(window && window.screen) {
            params.sh = window.screen.height || 0;
            params.sw = window.screen.width || 0;
            params.cd = window.screen.colorDepth || 0;
        }   
        //navigator对象数据
        if(navigator) {
            params.lang = navigator.language || ''; 
        }   
        //解析_maq配置
        if(_maq) {
            for(var i in _maq) {
                switch(_maq[i][0]) {
                    case '_setAccount':
                        params.account = _maq[i][1];
                        break;
                    default:
                        break;
                }   
            }   
        }   
        //拼接参数串
        var args = ''; 
        for(var i in params) {
            if(args != '') {
                args += '&';
            }   
            args += i + '=' + encodeURIComponent(params[i]);
        }   
     
        //通过Image对象请求后端脚本
        var img = new Image(1, 1); 
        img.src = 'http://spark3/log.gif?' + args;
    })();

     js请求的URL:

    http://spark3/log.gif?domain=spark2&url=http://spark2/content.html&title=内容&referrer=http://spark2/&sh=768&sw=1366&cd=24&lang=zh-CN&account=hll
    

      

    3:日志服务器搭建

    1.安装依赖

    yum -y install gcc perl pcre-devel openssl openssl-devel

    2.上传LuaJIT-2.0.4.tar.gz并安装LuaJIT

    tar -zxvf LuaJIT-2.0.4.tar.gz -C /usr/local/src/

    cd /usr/local/src/LuaJIT-2.0.4/

    make && make install PREFIX=/usr/local/luajit

    3.设置环境变量

    export LUAJIT_LIB=/usr/local/luajit/lib

    export LUAJIT_INC=/usr/local/luajit/include/luajit-2.0

    4.创建modules保存nginx的模块

    mkdir -p /usr/local/nginx/modules

    5.上传openresty-1.9.7.3.tar.gz和依赖的模块lua-nginx-module-0.10.0.tarngx_devel_kit-0.2.19.tarngx_devel_kit-0.2.19.tarecho-nginx-module-0.58.tar.gz

    6.将依赖的模块直接解压到/usr/local/nginx/modules目录即可,不需要编译安装

    tar -zxvf lua-nginx-module-0.10.0.tar.gz -C /usr/local/nginx/modules/

    tar -zxvf set-misc-nginx-module-0.29.tar.gz -C /usr/local/nginx/modules/

    tar -zxvf ngx_devel_kit-0.2.19.tar.gz -C /usr/local/nginx/modules/

    tar -zxvf echo-nginx-module-0.58.tar.gz -C /usr/local/nginx/modules/

    7.解压openresty-1.9.7.3.tar.gz

    tar -zxvf openresty-1.9.7.3.tar.gz -C /usr/local/src/

    cd /usr/local/src/openresty-1.9.7.3/

    8.编译安装openresty

    ./configure --prefix=/usr/local/openresty --with-luajit && make && make install

    9.上传nginx

    tar -zxvf nginx-1.8.1.tar.gz -C /usr/local/src/

    cd /usr/local/src/nginx-1.8.1/

    10.编译nginx并支持其他模块

    ./configure --prefix=/usr/local/nginx \

    --with-ld-opt="-Wl,-rpath,/usr/local/luajit/lib" \

        --add-module=/usr/local/nginx/modules/ngx_devel_kit-0.2.19 \

        --add-module=/usr/local/nginx/modules/lua-nginx-module-0.10.0 \

        --add-module=/usr/local/nginx/modules/set-misc-nginx-module-0.29 \

        --add-module=/usr/local/nginx/modules/echo-nginx-module-0.58

    make -j2

    make install

    11.修改nginx配置文件

    worker_processes  2;
    
    events {
        worker_connections  1024;
    }
    
    http {
        include       mime.types;
        default_type  application/octet-stream;
    
        log_format tick "$msec^A$remote_addr^A$u_domain^A$u_url^A$u_title^A$u_referrer^A$u_sh^A$u_sw^A$u_cd^A$u_lang^A$http_user_agent^A$u_utrace^A$u_account";
        
        access_log  logs/access.log  tick;
    
        sendfile        on;
    
        keepalive_timeout  65;
    
        server {
            listen       80;
            server_name  localhost;
            location /1.gif {
                #伪装成gif文件
                default_type image/gif;    
                #本身关闭access_log,通过subrequest记录log
                access_log off;
            
                access_by_lua "
                    -- 用户跟踪cookie名为__utrace
                    local uid = ngx.var.cookie___utrace        
                    if not uid then
                        -- 如果没有则生成一个跟踪cookie,算法为md5(时间戳+IP+客户端信息)
                        uid = ngx.md5(ngx.now() .. ngx.var.remote_addr .. ngx.var.http_user_agent)
                    end 
                    ngx.header['Set-Cookie'] = {'__utrace=' .. uid .. '; path=/'}
                    if ngx.var.arg_domain then
                    -- 通过subrequest到/i-log记录日志,将参数和用户跟踪cookie带过去
                        ngx.location.capture('/i-log?' .. ngx.var.args .. '&utrace=' .. uid)
                    end 
                ";  
            
                #此请求不缓存
                add_header Expires "Fri, 01 Jan 1980 00:00:00 GMT";
                add_header Pragma "no-cache";
                add_header Cache-Control "no-cache, max-age=0, must-revalidate";
            
                #返回一个1×1的空gif图片
                empty_gif;
            }   
        
            location /i-log {
                #内部location,不允许外部直接访问
                internal;
            
                #设置变量,注意需要unescape
                set_unescape_uri $u_domain $arg_domain;
                set_unescape_uri $u_url $arg_url;
                set_unescape_uri $u_title $arg_title;
                set_unescape_uri $u_referrer $arg_referrer;
                set_unescape_uri $u_sh $arg_sh;
                set_unescape_uri $u_sw $arg_sw;
                set_unescape_uri $u_cd $arg_cd;
                set_unescape_uri $u_lang $arg_lang;
                set_unescape_uri $u_utrace $arg_utrace;
                set_unescape_uri $u_account $arg_account;
            
                #打开日志
                log_subrequest on;
                #记录日志到ma.log,实际应用中最好加buffer,格式为tick
                access_log /var/nginx_logs/ma.log tick;
            
                #输出空字符串
                echo '';
            }
        }
    }

    查看日志:

    1489718383.170^A192.168.154.2^Aspark2^Ahttp://spark2/^A\xE6\xA3\xA3\xE6\xA0\xAD\xE3\x80\x89^A^A768^A1366^A24^Azh-CN^AMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36^A0f21f45cf2c1ba459e9812ee3de17d8a^Azx5352
    1489718385.448^A192.168.154.2^Aspark2^Ahttp://spark2/content.html^A\xE5\x86\x85\xE5\xAE\xB9^Ahttp://spark2/^A768^A1366^A24^Azh-CN^AMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36^A0f21f45cf2c1ba459e9812ee3de17d8a^Azx5352

    4:日志采集

     logstash配置文件

    input {
      file {
        type => "syslog"
        path => "/var/nginx_logs/track.log"
        discover_interval => 10
        start_position => "beginning" 
      }
        
    }
    output { stdout { codec => rubydebug } }

    [root@spark3 logstash]# bin/logstash -f config/log.conf

    logstash打印到屏幕的日志

    {
           "message" => "1489718383.170^A192.168.154.2^Aspark2^Ahttp://spark2/^A\\xE6\\xA3\\xA3\\xE6\\xA0\\xAD\\xE3\\x80\\x89^A^A768^A1366^A24^Azh-CN^AMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36^A0f21f45cf2c1ba459e9812ee3de17d8a^Azx5352",
          "@version" => "1",
        "@timestamp" => "2017-03-17T03:12:34.380Z",
              "path" => "/var/nginx_logs/track.log",
              "host" => "spark3",
              "type" => "syslog"
    }
    {
           "message" => "1489718385.448^A192.168.154.2^Aspark2^Ahttp://spark2/content.html^A\\xE5\\x86\\x85\\xE5\\xAE\\xB9^Ahttp://spark2/^A768^A1366^A24^Azh-CN^AMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36^A0f21f45cf2c1ba459e9812ee3de17d8a^Azx5352",
          "@version" => "1",
        "@timestamp" => "2017-03-17T03:12:34.906Z",
              "path" => "/var/nginx_logs/track.log",
              "host" => "spark3",
              "type" => "syslog"
    }
    • 可以使用logstash的filter对日志做一些过滤,使用output组件将日志写入kafka或者es等存储介质,以供后续的处理。
  • 相关阅读:
    用BAT使用FTP命令上传文件
    BAT自动复制最新插件至运行程序
    requests模块源码阅读总结
    Lucene查询语法汇总
    Ansible scp Python脚本
    4.2 rust 命令行参数
    4.1 python中调用rust程序
    冒泡排序
    Golang开发命令行工具之flag包的使用
    MySQL基于Binlog的数据恢复实战
  • 原文地址:https://www.cnblogs.com/huangll99/p/6564016.html
Copyright © 2011-2022 走看看