zoukankan      html  css  js  c++  java
  • Ganglia+nagios 监控hadoop资源与报警

    全篇主要依赖下面2篇文章

    http://quenlang.blog.51cto.com/4813803/1571635

    http://www.cnblogs.com/mchina/archive/2013/02/20/2883404.html#!comments

    一 资源下载

    ganglia-3.6.0.tar.gz

    ganglia-web-3.6.2.tar.gz

    nagios : http://sourceforge.net/projects/nagios/files/nagios-4.x/nagios-4.1.1/nagios-4.1.1.tar.gz/download

    nagios-plugs : http://www.nagios-plugins.org/download/nagios-plugins-2.1.1.tar.gz

    nrpe : http://sourceforge.net/projects/nagios/files/nrpe-2.x/nrpe-2.15/nrpe-2.15.tar.gz/download

    php-5.4.10.tar.gz

    二 ganglia 安装

    hadoop1安装ganglia的gmetad、gmond及ganglia-web

    2.1 依赖检验,安装

    新建一个 ganglia.rpm 文件,写入以下依赖组件

    $ vim ganglia.rpm
    apr-devel
    apr-util
    check-devel
    cairo-devel
    pango-devel
    libxml2-devel
    glib2-devel
    dbus-devel
    freetype-devel
    fontconfig-devel
    gcc-c++
    expat-devel
    python-devel
    rrdtool
    rrdtool-devel libXrender-devel zlib libart_lgpl libpng dejavu-lgc-sans-mono-fonts dejavu-sans-mono-fonts perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker

    查看这些组件是否有安装

    $ rpm -q `cat ganglia.rpm`
    package apr-devel is not installed
    apr-util-1.3.9-3.el6_0.1.x86_64
    check-devel-0.9.8-1.1.el6.x86_64
    cairo-devel-1.8.8-3.1.el6.x86_64
    pango-devel-1.28.1-10.el6.x86_64
    libxml2-devel-2.7.6-14.el6_5.2.x86_64
    glib2-devel-2.28.8-4.el6.x86_64
    dbus-devel-1.2.24-7.el6_3.x86_64
    freetype-devel-2.3.11-14.el6_3.1.x86_64
    fontconfig-devel-2.8.0-5.el6.x86_64
    gcc-c++-4.4.7-11.el6.x86_64
    package expat-devel is not installed
    python-devel-2.6.6-52.el6.x86_64
    libXrender-devel-0.9.8-2.1.el6.x86_64
    zlib-1.2.3-29.el6.x86_64
    libart_lgpl-2.3.20-5.1.el6.x86_64
    libpng-1.2.49-1.el6_2.x86_64
    package dejavu-lgc-sans-mono-fonts is not installed
    package dejavu-sans-mono-fonts is not installed
    perl-ExtUtils-CBuilder-0.27-136.el6.x86_64
    perl-ExtUtils-MakeMaker-6.55-136.el6.x86_64

    使用 yum install 安装机器上没有的组件

    还要安装 confuse

    下载地址:http://www.nongnu.org/confuse/

    $ tar -zxf confuse-2.7.tar.gz
    $ cd confuse-2.7
    $ ./configure CFLAGS=-fPIC --disable-nls
    $ make && make install

    2.2  安装gangali

    hadoop1上安装

    $ tar -xvf /home/hadoop/ganglia-3.6.0.tar.gz -C /opt/soft/
    ## 安装gmetad
    $ ./configure --prefix=/usr/local/ganglia --with-gmetad --with-libpcre=no --enable-gexec --enable-status --sysconfdir=/etc/ganglia
    $ make && make install
    $ cp gmetad/gmetad.init /etc/init.d/gmetad
    $ cp /usr/local/ganglia/sbin/gmetad /usr/sbin/
    $ chkconfig --add gmetad
    ## 安装gmond
    $ cp gmond/gmond.init /etc/init.d/gmond
    $ cp /usr/local/ganglia/sbin/gmond /usr/sbin/
    $ gmond --default_config>/etc/ganglia/gmond.conf
    $ chkconfig --add gmond

     gmetad、gmond安装成功,接着安装ganglia-web,首先要安装php和httpd

    yum install php httpd -y

    修改httpd的配置文件/etc/httpd/conf/httpd.conf,只把监听端口改为8080

    Listen 8080

     安装ganglia-web

    $ tar xf ganglia-web-3.6.2.tar.gz  -C /opt/soft/
    $ cd /opt/soft/
    $ chmod -R 777 ganglia-web-3.6.2/
    $ mv ganglia-web-3.6.2/ /var/www/html/ganglia
    $ cd /var/www/html/ganglia
    $ useradd www-data
    $
    make install
    $
    chmod 777 /var/lib/ganglia-web/dwoo/cache/
    $
    chmod 777 /var/lib/ganglia-web/dwoo/compiled/

     至此ganglia-web安装完成,修改conf_default.php修改文件,指定ganglia-web的目录及rrds的数据目录,修改如下两行:

    36 # Where gmetad stores the rrd archives.
    37 $conf['gmetad_root'] = "/var/www/html/ganglia"; ## 改为web程序的安装目录
    38 $conf['rrds'] = "/var/lib/ganglia/rrds";        ## 指定rrd数据存放的路径

    创建rrd数据存放目录并授权

    $ mkdir /var/lib/ganglia/rrds -p
    $ chown nobody:nobody /var/lib/ganglia/rrds/ -R

    到这里,hadoop1上的ganglia的所有安装工作就完成了,接下来就是要在其他所有节点上安装ganglia的gmond客户端。

    其他节点安装上gmond

    也是要先安装依赖,然后在安装gmond,所有节点安装都是一样的,所以这里写个脚本

    $ vim install_ganglia.sh
    
    #!/bin/sh
    
    #安装依赖  这是是我已经知道我缺少哪些依赖,所以只安装这些,具体按照你的环境来列出需要安装哪些
    yum install -y apr-devel expat-devel rrdtool rrdtool-devel
    
    mkdir /opt/soft;cd /opt/soft
    tar -xvf /home/hadoop/confuse-2.7.tar.gz
    cd confuse-2.7
    ./configure CFLAGS=-fPIC --disable-nls
    make && make install
    cd /opt/soft
    #安装 ganglia gmond
    tar -xvf /home/hadoop/ganglia-3.6.0.tar.gz
    cd ganglia-3.6.0/
    ./configure --prefix=/usr/local/ganglia --with-libpcre=no --enable-gexec --enable-status --sysconfdir=/etc/ganglia
    make && make install
    cp gmond/gmond.init /etc/init.d/gmond
    cp /usr/local/ganglia/sbin/gmond /usr/sbin/
    gmond --default_config>/etc/ganglia/gmond.conf
    chkconfig --add gmond

    将这个脚本复制到所有节点执行

    2.3 配置ganglia

    分为服务端和客户端的配置,服务端的配置文件为gmetad.conf,客户端的配置文件为gmond.conf

    首先配置hadoop1上的gmetad.conf,这个文件只有hadoop1上有

    $ vi  /etc/ganglia/gmetad.conf
    ## 定义数据源的名字及监听地址,gmond会将收集的数据发送到数据源监听机器上的rrd数据目录中
    ##
    hadoop cluster 为自己定义
    data_source "hadoop cluster" 192.168.0.101:8649

    接着配置 gmond.conf

    $ head -n 80 /etc/ganglia/gmond.conf
    
    /* This configuration is as close to 2.5.x default behavior as possible
       The values closely match ./gmond/metric.h definitions in 2.5.x */
    globals {
      daemonize = yes        ## 以守护进程运行
      setuid = yes           
      user = nobody          ## 运行gmond的用户
      debug_level = 0        ## 改为1会在启动时打印debug信息
      max_udp_msg_len = 1472
      mute = no              ## 哑巴,本节点将不会再广播任何自己收集到的数据到网络上
      deaf = no              ## 聋子,本节点将不再接收任何其他节点广播的数据包
      allow_extra_data = yes
      host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */
      host_tmax = 20 /*secs */
      cleanup_threshold = 300 /*secs */
      gexec = no
      # By default gmond will use reverse DNS resolution when displaying your hostname
      # Uncommeting following value will override that value.
      # override_hostname = "mywebserver.domain.com"
      # If you are not using multicast this value should be set to something other than 0.
      # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable
      send_metadata_interval = 0 /*secs */
     
    }
     
    /*
     * The cluster attributes specified will be used as part of the <CLUSTER>
     * tag that will wrap all hosts collected by this instance.
     */
    cluster {
      name = "hadoop cluster"    ## 指定集群的名字
      owner = "nobody"           ## 集群的所有者
      latlong = "unspecified"
      url = "unspecified"
    }
     
    /* The host section describes attributes of the host, like the location */
    host {
      location = "unspecified"
    }
     
    /* Feel free to specify as many udp_send_channels as you like.  Gmond
       used to only support having a single channel */
    udp_send_channel {
      #bind_hostname = yes # Highly recommended, soon to be default.
                           # This option tells gmond to use a source address
                           # that resolves to the machine's hostname.  Without
                           # this, the metrics may appear to come from any
                           # interface and the DNS names associated with
                           # those IPs will be used to create the RRDs.
    #  mcast_join = 239.2.11.71    ## 单播模式要注释调这行
      host = 192.168.0.101    ## 单播模式,指定接受数据的主机
      port = 8649             ## 监听端口
      ttl = 1
    }
     
    /* You can specify as many udp_recv_channels as you like as well. */
    udp_recv_channel {
      #mcast_join = 239.2.11.71    ## 单播模式要注释调这行
      port = 8649
      #bind = 239.2.11.71          ## 单播模式要注释调这行
      retry_bind = true
      # Size of the UDP buffer. If you are handling lots of metrics you really
      # should bump it up to e.g. 10MB or even higher.
      # buffer = 10485760
    }
     
    /* You can specify as many tcp_accept_channels as you like to share
       an xml description of the state of the cluster */
    tcp_accept_channel {
      port = 8649
      # If you want to gzip XML output
      gzip_output = no
    }
     
    /* Channel to receive sFlow datagrams */
    #udp_recv_channel {
    #  port = 6343
    #}
     
    /* Optional sFlow settings */

    好了,hadoop1上的gmetad.conf和gmond.conf配置文件已经修改完成,这时,直接将hadoop1上的gmond.conf文件scp到其他节点上相同的路径下覆盖原来的gmond.conf即可。

    2.4 启动 ganglia

    所有节点启动 gmond 服务

    /etc/init.d/gmond start

    hadoop1 节点启动 gmetad httpd 服务

    /etc/init.d/gmetad start
    /etc/init.d/httpd start

    2.5 在浏览器中访问hadoop1:8080/ganglia,就会出现下面的页面

    配置完成

    三 配置hadoop 

    此时,ganglia只是监控了各主机基本的性能,并没有监控到hadoop,接下来需要配置hadoop配置文件,这里以hadoop1上的配置文件为例,其他节点对应的配置文件应从hadoop1上拷贝,首先需要修改的是hadoop配置目录下的hadoop-metrics2.properties

    $ cd /usr/local/hadoop-2.6.0/etc/hadoop/
    $ vim hadoop-metrics2.properties
    # for Ganglia 3.1 support
     *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
    
     *.sink.ganglia.period=10
    
    # default for supportsparse is false
     *.sink.ganglia.supportsparse=true
    
    *.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both
    *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40
    
    # Tag values to use for the ganglia prefix. If not defined no tags are used.
    # If '*' all tags are used. If specifiying multiple tags separate them with 
    # commas. Note that the last segment of the property name is the context name.
    #
    #*.sink.ganglia.tagsForPrefix.jvm=ProcesName
    #*.sink.ganglia.tagsForPrefix.dfs=
    #*.sink.ganglia.tagsForPrefix.rpc=
    #*.sink.ganglia.tagsForPrefix.mapred=
    
    namenode.sink.ganglia.servers=192.168.0.101:8649 
    datanode.sink.ganglia.servers=
    192.168.0.101:8649
    resourcemanager.sink.ganglia.servers=
    192.168.0.101:8649
    nodemanager.sink.ganglia.servers=
    192.168.0.101:8649
    mrappmaster.sink.ganglia.servers=
    192.168.0.101:8649
    jobhistoryserver.sink.ganglia.serve=
    192.168.0.101:8649

    复制到所有节点,重启hadoop集群

    此时在监控中已经可以看到关于hadoop指标的监控了

    四 nagios 安装

    4.1 hadoop1 机器

    新建nagios用户

    # useradd -s /sbin/nologin nagios
    # mkdir /usr/local/nagios
    # chown -R nagios.nagios /usr/local/nagios

    4.1.1 编译安装nagios

    $ cd /opt/soft
    $ tar zxvf nagios-3.4.3.tar.gz
    $ cd nagios-3.4.3
    $ ./configure --prefix=/usr/local/nagios
    $ make al
    $ make install
    $ make install-init
    $ make install-config
    $ make install-commandmode
    $ make install-webconf

    切换目录到安装路径(这里是/usr/local/nagios),看是否存在etc、bin、sbin、share、var 这五个目录,如果存在则可以表明程序被正确的安装到系统了

    4.1.2 编译安装 nagios-plugs

    $ cd /opt/soft
    $ tar zxvf nagios-plugins-1.4.16.tar.gz
    $ cd nagios-plugins-1.4.16
    $ mkdir /user/local/nagios
    $ ./configure --prefix=/usr/local/nagios $ make && make install

    4.1.3 安装 check_nrpe 插件

    $ cd /opt/soft/
    $ tar -xvf /home/hadoop/nrpe-2.15.tar.gz
    $ cd nrpe-2.15/
    $ ./configure
    $ make all
    $ make install-plugin

    4.2 datanode 节点

    datanode只要安装nagios-plugs 与 nrpe.

    因为所有节点是一样的,这里写个脚本

    #!/bin/sh
    
    adduser nagios
    
    cd /opt/soft
    tar xvf /home/hadoop/nagios-plugins-2.1.1.tar.gz
    cd nagios-plugins-2.1.1
    mkdir /usr/local/nagios
    ./configure --prefix=/usr/local/nagios
    make && make install
    
    chown nagios.nagios /usr/local/nagios
    chown -R nagios.nagios /usr/local/nagios/libexec
    
    #安装xinetd.看你的机器是否有xinetd,如果没有就安装,有的话就不用了
    yum install xinetd -y
    cd ..
    / tar xvf /home/hadoop/nrpe-2.15.tar.gz cd nrpe-2.15 ./configure make all make install-daemon make install-daemon-config make install-xinetd

    安装完成后

    修改nrpe.cfg

    $ vim /usr/local/nagios/etc/nrpe.cfg 
    log_facility=daemon
    pid_file=/var/run/nrpe.pid
    ## nagios的监听端口
    server_port=5666
    nrpe_user=nagios
    nrpe_group=nagios
    ## nagios服务器主机地址
    allowed_hosts=xx.xxx.x.xx
    dont_blame_nrpe=0
    allow_bash_command_substitution=0
    debug=0
    command_timeout=60
    connection_timeout=300
     
    ## 监控负载
    command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
    ## 当前系统用户数
    command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c 10
    ## 根分区空闲容量
    command[check_sda2]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda2
    ## mysql状态
    command[check_mysql]=/usr/local/nagios/libexec/check_mysql -H localhost -P 3306 -d kora -u kora -p upbjsxt
    ## 主机是否存活
    command[check_ping]=/usr/local/nagios/libexec/check_ping -H localhost -w  100.0,20% -c  500.0,60%
    ## 当前系统的进程总数
    command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 150 -c 200
    ## swap使用情况
    command[check_swap]=/usr/local/nagios/libexec/check_swap -w 20 -c 10

    只有在被监控机器的这个配置文件中定义的命令,在监控机器(也就是hadoop1)上才能通过nrpe插件获取.也就是想监控机器的什么指标必须现在此处定义

    同步到其他所有datanode节点

    可以看到创建了这个文件/etc/xinetd.d/nrpe。

    编辑这个脚本(图用的其他文章的图,版本号跟配置不一样,意思到就行了):

    在only_from 后增加监控主机的IP地址。

    编辑/etc/services 文件,增加NRPE服务

    重启xinted 服务

    # service xinetd restart

    查看NRPE 是否已经启动

    可以看到5666端口已经在监听了。

    4.3 配置

    在hadoop1上

     要想让nagios与ganglia整合起来,就需要在hadoop1上把ganglia安装包中的ganglia的插件放到nagios的插件目录下

    $ /opt/soft/ganglia-3.6.0
    $ cp contrib/check_ganglia.py /usr/local/nagios/libexec/

     默认的check_ganglia.py 插件中只有监控项的实际值大于critical阀值的情况,这里需要增加监控项的实际值小于critical阀值的情况,即最后添加的一段代码 

    $ vim  /usr/local/nagios/libexec/check_ganglia.py
    
     88   if critical > warning:
     89     if value >= critical:
     90       print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
     91       sys.exit(2)
     92     elif value >= warning:
     93       print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)
     94       sys.exit(1)
     95     else:
     96       print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
     97       sys.exit(0)
     98   else:
     99     if critical >=value:
    100       print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
    101       sys.exit(2)
    102     elif warning >=value:
    103       print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)
    104       sys.exit(1)
    105     else:
    106       print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
    107       sys.exit(0)

    最后改成上面这样

    hadoop1上配置各个主机及对应的监控项

    没配置前,现在目录结构是这样的

    $ cd /usr/local/nagios/etc/objects/
    $ ll
    total 48
    -rw-rw-r-- 1 nagios nagios  8010 9月  11 14:59 commands.cfg
    -rw-rw-r-- 1 nagios nagios  2138 9月  11 11:35 contacts.cfg
    -rw-rw-r-- 1 nagios nagios  5375 9月  11 11:35 localhost.cfg
    -rw-rw-r-- 1 nagios nagios  3096 9月  11 11:35 printer.cfg
    -rw-rw-r-- 1 nagios nagios  3265 9月  11 11:35 switch.cfg
    -rw-rw-r-- 1 nagios nagios 10621 9月  11 11:35 templates.cfg
    -rw-rw-r-- 1 nagios nagios  3180 9月  11 11:35 timeperiods.cfg
    -rw-rw-r-- 1 nagios nagios  3991 9月  11 11:35 windows.cfg

    注意:cfg的文件跟在配置后面的说明注释一定要用逗号,而不是#号.我就是因为一开始用了#号,结果一直出问题找不到是什么原因

    修改 commands.cfg

    在文件最后加上如下内容

    # 'check_ganglia' command definition
    define command{
            command_name    check_ganglia
            command_line    $USER1$/check_ganglia.py -h $HOSTADDRESS$ -m $ARG1$ -w $ARG2$ -c $ARG3$
            }
    
    # 'check_nrpe' command definition
    define command{
            command_name    check_nrpe
            command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
            }

    修改templates.cfg

    我有18台datanode机器,这里篇幅原因只截取5个,后面依次再加就行了

    define service { 
            use generic-service 
            name ganglia-service1     ;这里的配置在service1.cfg中用到
            hostgroup_name a01    ;这里的配置在hadoop1.cfg中用到
            service_groups ganglia-metrics1    ;这里的配置在service1.cfg中用到
            register        0
    }
     
    define service { 
            use generic-service    
            name ganglia-service2    ;这里的配置在service2.cfg中用到 
            hostgroup_name a02    ;这里的配置在hadoop2.cfg中用到
            service_groups ganglia-metrics2    ;这里的配置在service2.cfg中用到
            register        0
    }
    define service { 
            use generic-service 
            name ganglia-service3    ;这里的配置在service3.cfg中用到 
            hostgroup_name a03    ;这里的配置在hadoop3.cfg中用到
            service_groups ganglia-metrics3    ;这里的配置在service3.cfg中用到
            register        0
    }
    define service { 
            use generic-service 
            name ganglia-service4    ;这里的配置在service4.cfg中用到 
            hostgroup_name a04    ;这里的配置在hadoop4.cfg中用到
            service_groups ganglia-metrics4    ;这里的配置在service4.cfg中用到
            register        0
    }
    define service { 
            use generic-service     
            name ganglia-service5    ;这里的配置在service5.cfg中用到     
            hostgroup_name a05    ;这里的配置在hadoop5.cfg中用到    
            service_groups ganglia-metrics5    ;这里的配置在service5.cfg中用到
            register        0
    }

    hadoop1.cfg 配置

    这个默认是没有,用localhost.cfg 拷贝来

    $cp localhost.cfg hadoop1.cfg
    # vim hadoop1.cfg 
    define host{   
            use                     linux-server 
            host_name               a01
            alias                   a01
            address                a01
            }
     
    define hostgroup { 
            hostgroup_name  a01
            alias  a01
            members a01
            }
    define service{
            use                             local-service
            host_name                       a01
            service_description             PING
            check_command                   check_ping!100,20%!500,60%
            }
     
    define service{
            use                             local-service
            host_name                      a01
            service_description             根分区
            check_command                   check_local_disk!20%!10%!/
    #       contact_groups                  admins
            }
     
    define service{
            use                             local-service
            host_name                       a01
            service_description             用户数量
            check_command                   check_local_users!20!50
            }
     
    define service{
            use                             local-service
            host_name                       a01
            service_description             进程数
            check_command                   check_local_procs!550!650!RSZDT
            }
     
    define service{ 
            use                             local-service         
            host_name                       a01
            service_description             系统负载
            check_command                   check_local_load!5.0,4.0,3.0!10.0,6.0,4.0
    } 

    service1.cfg 配置

    默认没有service1.cfg,新建一个

    $ vim service1.cfg
    
    define servicegroup { 
            servicegroup_name ganglia-metrics1
            alias Ganglia Metrics1
    } 
    ## 这里的check_ganglia为commonds.cfg中声明的check_ganglia命令
    define service{ 
            use                             ganglia-service1
            service_description             内存空闲
            check_command                   check_ganglia!mem_free!200!50
    } 
     
    define service{
            use                             ganglia-service1
            service_description             NameNode同步
            check_command                   check_ganglia!dfs.namenode.SyncsAvgTime!10!50
    }

    hadoop2.cfg 配置

    需要注意使用check_nrpe插件的监控项必须要在hadoop2上的nrpe.cfg中声明

    也就是每个service里的check_command必须在这台机器的 nrpe.cfg 中声明了才有用,比且要保证名称一样

    $ cp localhost.cfg hadoop2.cfg
    $ vim hadoop2.cfg
    define host{
            use                     linux-server            ; Name of host template to use
                                                            ; This host definition will inherit all variables that are defined
                                                            ; in (or inherited by) the linux-server host template definition.
            host_name               a02
            alias                   a02
            address                 a02
            }
    
    # Define an optional hostgroup for Linux machines
    
    define hostgroup{
            hostgroup_name  a02; The name of the hostgroup
            alias           a02 ; Long name of the group
            members         a02    ; Comma separated list of hosts that belong to this group
            }
    
    # Define a service to "ping" the local machine
    
    define service{
            use                             local-service         ; Name of service template to use
            host_name                       a02
            service_description             PING
            check_command                   check_nrpe!check_ping
            }
    
    
    # Define a service to check the disk space of the root partition
    # on the local machine.  Warning if < 20% free, critical if
    # < 10% free space on partition.
    
    define service{
            use                             local-service         ; Name of service template to use
            host_name                       a02
            service_description             Root Partition
            check_command                   check_nrpe!check_sda2
            }
    
    
    
    # Define a service to check the number of currently logged in
    # users on the local machine.  Warning if > 20 users, critical
    # if > 50 users.
    
    define service{
            use                             local-service         ; Name of service template to use
            host_name                       a02
            service_description             Current Users
            check_command                   check_nrpe!check_users
            }
    
    
    # Define a service to check the number of currently running procs
    # on the local machine.  Warning if > 250 processes, critical if
    # > 400 users.
    
    define service{
            use                             local-service         ; Name of service template to use
            host_name                       a02
            service_description             Total Processes
            check_command                   check_nrpe!check_total_procs
            }
    
    define service{
            use                             local-service         ; Name of service template to use
            host_name                       a02
            service_description             Current Load
            check_command                   check_nrpe!check_load
            }
    
    
    
    # Define a service to check the swap usage the local machine. 
    # Critical if less than 10% of swap is free, warning if less than 20% is free
    
    define service{
            use                             local-service         ; Name of service template to use
            host_name                       a02
            service_description             Swap Usage
            check_command                   check_nrpe!check_swap
            }

    hadoop2的设置完,拷贝16份,因为datanode配置基本一样,就是hostname有点小区别

    $ for i in {3..18};do cp hadoop2.cfg hadoop$i.cfg;done

    将剩下里面hostname改下就行,后面就不说了

    service2.cfg 配置

    新建文件并配置

    $ vim service2.cfg 
    define servicegroup {
            servicegroup_name ganglia-metrics2
            alias Ganglia Metrics2
    }
    
    define service{
            use     ganglia-service2
            service_description     内存空闲
            check_command   check_ganglia!mem_free!200!50
    }
    
    define service{
            use     ganglia-service2
            service_description     RegionServer_Get
            check_command   check_ganglia!yarn.NodeManagerMetrics.AvailableVCores!!7
    }
    
    define service{
            use     ganglia-service2
            service_description     DateNode_Heartbeat
            check_command   check_ganglia!dfs.datanode.HeartbeatsAvgTime!15!40

    service2的设置完,拷贝16份,因为datanode配置基本一样,就是servicegroup_name,use有点小区别

    $ for i in {3..18};do scp service2.cfg service$i.cfg;done

    改成对应的编号

    修改 nagios.cfg

    $ vim  ../nagios.cfg
    cfg_file=/usr/local/nagios/etc/objects/commands.cfg
    cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
    cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
    cfg_file=/usr/local/nagios/etc/objects/templates.cfg
    
    #引进host文件
    cfg_file=/usr/local/nagios/etc/objects/hadoop1.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop2.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop3.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop4.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop5.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop6.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop7.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop8.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop9.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop10.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop11.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop12.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop13.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop14.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop15.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop16.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop17.cfg
    cfg_file=/usr/local/nagios/etc/objects/hadoop18.cfg
    
    #引进监控项的文件
    cfg_file=/usr/local/nagios/etc/objects/service1.cfg
    cfg_file=/usr/local/nagios/etc/objects/service2.cfg
    cfg_file=/usr/local/nagios/etc/objects/service3.cfg
    cfg_file=/usr/local/nagios/etc/objects/service4.cfg
    cfg_file=/usr/local/nagios/etc/objects/service5.cfg
    cfg_file=/usr/local/nagios/etc/objects/service6.cfg
    cfg_file=/usr/local/nagios/etc/objects/service7.cfg
    cfg_file=/usr/local/nagios/etc/objects/service8.cfg
    cfg_file=/usr/local/nagios/etc/objects/service9.cfg
    cfg_file=/usr/local/nagios/etc/objects/service10.cfg
    cfg_file=/usr/local/nagios/etc/objects/service11.cfg
    cfg_file=/usr/local/nagios/etc/objects/service12.cfg
    cfg_file=/usr/local/nagios/etc/objects/service13.cfg
    cfg_file=/usr/local/nagios/etc/objects/service14.cfg
    cfg_file=/usr/local/nagios/etc/objects/service15.cfg
    cfg_file=/usr/local/nagios/etc/objects/service16.cfg
    cfg_file=/usr/local/nagios/etc/objects/service17.cfg
    cfg_file=/usr/local/nagios/etc/objects/service18.cfg

    验证配置是否正确

    $ pwd
    /usr/local/nagios/etc
    
    $ ../bin/nagios -v nagios.cfg 
    
    Nagios Core 4.1.1
    Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
    Copyright (c) 1999-2009 Ethan Galstad
    Last Modified: 08-19-2015
    License: GPL
    
    Website: https://www.nagios.org
    Reading configuration data...
       Read main config file okay...
       Read object config files okay...
    
    Running pre-flight check on configuration data...
    
    Checking objects...
        Checked 161 services.
        Checked 18 hosts.
        Checked 18 host groups.
        Checked 18 service groups.
        Checked 1 contacts.
        Checked 1 contact groups.
        Checked 26 commands.
        Checked 5 time periods.
        Checked 0 host escalations.
        Checked 0 service escalations.
    Checking for circular paths...
        Checked 18 hosts
        Checked 0 service dependencies
        Checked 0 host dependencies
        Checked 5 timeperiods
    Checking global event handlers...
    Checking obsessive compulsive processor commands...
    Checking misc settings...
    
    Total Warnings: 0
    Total Errors:   0
    
    Things look okay - No serious problems were detected during the pre-flight check

    没有错误,这时就可以启动hadoop1上的nagios服务

    $ /etc/init.d/nagios start
    Starting nagios: done.

    因为之前datanode上的nrpe已经启动了

    测试hadoop1与datanode上nrpe通信是否正常

    ]$ for i in {10..28};do /usr/local/nagios/libexec/check_nrpe -H xx.xxx.x.$i;done
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15
    NRPE v2.15

    ok,通信正常,验证check_ganglia.py插件是否工作正常

    $ /usr/local/nagios/libexec/check_ganglia.py -h a01 -m mem_free -w 200 -c 50
    CHECKGANGLIA OK: mem_free is 61840868.00

     工作正常,现在我们可以nagios的web页面,看是否监控成功。

    localhost:8080/nagios

    4.4 邮件报警配置

    先检查服务器是否安装sendmail

    $ rpm -q sendmail
    $ yum install sendmail #如果没有就安装sendmail
    $ service sendmail restart #重启sendmail

    因为给外部发邮件,需要服务器自己有邮件服务器,这很麻烦并且非常占资源.这里我们配置一下,使用现有的STMP服务器

    配置地址 /etc/mail.rc

    $ vim /etc/mail.rc
    
    set from=systeminformation@xxx.com
    set smtp=mail.xxx.com smtp-auth-user=systeminformation smtp-auth-password=111111 smtp-auth=login

    配置完毕之后,就可以先命令行测试一下,是否可以发邮件了

    $ echo "hello world" |mail -s "test" pingjie@xxx.com

    如果看你的邮件已经收到邮件了,说明sendmail已经没有问题.

    下面配置nagios的邮件告警配置

    $ vim /usr/local/nagios/etc/objects/contacts.cfg
    define contact{
            contact_name                    nagiosadmin             ; Short name of user
            use                             generic-contact         ; Inherit default values from generic-contact template (defined above)
            alias                           Nagios Admin            ; Full name of user
            ## 告警时间段
            service_notification_period     24x7
            host_notification_period        24x7
            ## 告警信息格式
            service_notification_options    w,u,c,r,f,s
            host_notification_options       d,u,r,f,s
            ## 告警方式为邮件
            service_notification_commands   notify-service-by-email
            host_notification_commands      notify-host-by-email
            email                           pingjie@xxx.com       ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
            }
    
    
    # We only have one contact in this simple configuration file, so there is
    # no need to create more than one contact group.
    
    define contactgroup{
            contactgroup_name       admins
            alias                   Nagios Administrators
            members                 nagiosadmin
            }

    至此配置全部完成

    脚本监控hadoop进程

    1.监控datanode的脚本

    就是用python 读取HDFS页面,再正则匹配到Live Nodes这部分

     1 #!/usr/bin/env python
     2 
     3 import commands
     4 import sys
     5 from optparse import OptionParser
     6 import urllib
     7 import re
     8 
     9 def get_value():
    10     urlItem = urllib.urlopen("http://namenode:50070/dfshealth.jsp")
    11     html = urlItem.read()
    12     urlItem.close()
    13     return float(re.findall('.+Live Nodes</a> <td id="col2"> :<td id="col3">\s+(d+)\s+\(Decommissioned: d+\)<tr class="rowNormal">.+', html)[0])
    14 
    15 if __name__ == '__main__':
    16 
    17     parser = OptionParser(usage="%prog [-w] [-c]", version="%prog 1.0")
    18     parser.add_option("-w", "--warning", type="int", dest="w", default=16)
    19     parser.add_option("-c", "--critical", type="int", dest="c", default=15)
    20     (options, args) = parser.parse_args()
    21 
    22     if(options.c >= options.w):
    23         print '-w must greater then -c'
    24         sys.exit(1)
    25 
    26     value = get_value()
    27 
    28     if(value <= options.c ) :
    29         print 'CRITICAL - Live Nodes %d' %(value)
    30         sys.exit(2)
    31     elif(value <= options.w):
    32         print 'WARNING - Live Nodes %d' %(value)
    33         sys.exit(1)
    34     else:
    35         print 'OK - Live Nodes %d' %(value)
    36         sys.exit(0)

    2.监控dfs空间:

    #!/usr/bin/env python
    
    import commands
    import sys
    from optparse import OptionParser
    import urllib
    import re
    
    def get_dfs_free_percent():
        urlItem = urllib.urlopen("http://namenode:50070/dfshealth.jsp")
        html = urlItem.read()
        urlItem.close()
        return float(re.findall('.+<td id="col1"> DFS Remaining%<td id="col2"> :<td id="col3">\s+(d+\.d+)%<tr class="rowAlt">.+', html)[0])
    
    if __name__ == '__main__':
    
        parser = OptionParser(usage="%prog [-w] [-c]", version="%prog 1.0")
        parser.add_option("-w", "--warning", type="int", dest="w", default=30, help="total dfs used percent")
        parser.add_option("-c", "--critical", type="int", dest="c", default=20, help="total dfs used percent")
        (options, args) = parser.parse_args()
    
        if(options.c >= options.w):
            print '-w must greater then -c'
            sys.exit(1)
    
        dfs_free_percent = get_dfs_free_percent()
    
        if(dfs_free_percent <= options.c ) :
            print 'CRITICAL - DFS free %d%%' %(dfs_free_percent)
            sys.exit(2)
        elif(dfs_free_percent <= options.w):
            print 'WARNING - DFS free %d%%' %(dfs_free_percent)
            sys.exit(1)
        else:
            print 'OK - DFS free %d%%' %(dfs_free_percent)
            sys.exit(0)

    如果脚本出错,就进python命令行,根据html的结果调试一下正则部分即可

    拷贝这2个脚本到/usr/local/nagios/etc/objects/

    这2个脚本单独在命令行使用 ./check_hadoop_datanode.py 这种方式执行一下试试,如果报这个错

    : No such file or directory

    vim打开文件后,命令模式执行 :set ff=unix  , 然后保存就行了

    3. 修改nagios配置

    commands.cfg 增加如下2个command

    $ vim /usr/local/nagios/etc/objects/commands.cfg
    define command{
            command_name    check_datanode
            command_line    $USER1$/check_hadoop_datanode.py -w $ARG1$ -c $ARG2$
            }
    
    define command{
            command_name    check_dfs
            command_line    $USER1$/check_hadoop_dfs.py -w $ARG1$ -c $ARG2$
            }

    修改server1.cfg,增加如下2个service

    $ vim service1.cfg 
    define service{
            use     ganglia-service1
            service_description     datanode存活个数
            check_command   check_datanode!16!15
    }
    
    
    define service{
            use     ganglia-service1
            service_description     dfs剩余空间
            check_command   check_dfs!30!20
    }

    完成

    五问题记录

    5.1 ganglia监控的指标有问题

    问题描述:为了测试nagios报警功能,然后我就kill了一个节点的datanode,但是看nagios上一直显示这个datanode是正常的.因为nagios这些指标是从ganglia来的,于是就找到ganglia上,发现也是正常的.这个问题就很奇怪了,为啥datanode已经kill了还一直发心跳

    解决方案:没有,有知道的请赐教。曲线救国,nagios使用脚本方式监控进程

  • 相关阅读:
    jupyter安装出现问题:安装后无法打开
    GitHub上传文件问题总结
    GitHub上传文件夹
    ELK 搭建
    mysql 开放远程连接权限连不上
    mysql linux下安装
    多个springboot项目部署到tomcat,Error deploying web application archive
    mysql 新增时,唯一索引冲突时更新
    日期计算
    mysql 忘记密码
  • 原文地址:https://www.cnblogs.com/pingjie/p/4809489.html
Copyright © 2011-2022 走看看