zoukankan      html  css  js  c++  java
  • 蓝鲸:安装SaaS组件bk_monitor失败分析解决

    使用./bk_install saas-o 安装发现bk_monitor(蓝鲸监控)组件报错“ERROR deploy failed: timeout”。

    单独尝试安装各个组件:

    #故障自愈
    [root@rbtnode1 install]# ./bk_install saas-o bk_fta_solutions
    
    #日志检索
    [root@rbtnode1 install]# ./bk_install saas-o bk_log_search
    
    #节点管理
    [root@rbtnode1 install]# ./bk_install saas-o bk_nodeman
    
    #标准运维
    [root@rbtnode1 install]# ./bk_install saas-o bk_sops
    
    #蓝鲸监控
    [root@rbtnode1 install]# ./bk_install saas-o bk_monitor
    

    发现前面几个bk_fta_solutions、bk_log_search、bk_nodeman、bk_sops都可以安装成功,唯独对bk_monitor安装,依然报错如下:

    [root@rbtnode1 install]# ./bk_install saas-o bk_monitor
    省略输出..
    2020-03-09 13:27:36 125  INFO    check deploy result. retry 132
    2020-03-09 13:27:39 125  INFO    check deploy result. retry 133
    2020-03-09 13:27:39 134  ERROR  deploy failed: timeout
    [192.168.1.6]20200309-132739 153   Deploy saas bk_monitor failed.
    [192.168.1.6]20200309-132739 47   Abort
    

    进一步查看agent日志(/data/bkce/logs/paas_agent/agent.log),最终因为部署任务timeout而终止,未见其他明显报错:

    2020/03/09 13:24:57 job.go:279: Building wheels for collected packages: gevent, netifaces, arrow, msgpack-python, wrapt, itypes, backports.shutil-get-terminal-size, simplegeneric, scandir
    
    2020/03/09 13:24:57 job.go:279:   Running setup.py bdist_wheel for gevent: started
    
    2020/03/09 13:27:32 job.go:279:   Running setup.py bdist_wheel for gevent: still running...
    
    2020/03/09 13:27:38 job.go:297: Deployment task execution timeout
    

    查了些网上资料,说是因为机器配置不够,增加核数为6即可解决,但实际我测试无效,报错不变;
    在蓝鲸官方群咨询,客服给出一个解决方案:

    但是实际这个Case和我这里遇到的情况并不一样,因为我这没有看到那个error。
    晚上重新整理下思路,借鉴案例中清理环境的方式,然后重新部署,这次agent.log看到报错信息了:

    2020/03/10 02:29:54 job.go:279:   File "/opt/py27_e/lib/python2.7/site-packages/pymysql/connections.py", line 906, in _read_packet
    
    2020/03/10 02:29:54 job.go:279:     packet.check_error()
    
    2020/03/10 02:29:54 job.go:279:   File "/opt/py27_e/lib/python2.7/site-packages/pymysql/connections.py", line 367, in check_error
    
    2020/03/10 02:29:54 job.go:279:     err.raise_mysql_exception(self._data)
    
    2020/03/10 02:29:54 job.go:279:   File "/opt/py27_e/lib/python2.7/site-packages/pymysql/err.py", line 120, in raise_mysql_exception
    
    2020/03/10 02:29:54 job.go:279:     _check_mysql_exception(errinfo)
    
    2020/03/10 02:29:54 job.go:279:   File "/opt/py27_e/lib/python2.7/site-packages/pymysql/err.py", line 115, in _check_mysql_exception
    
    2020/03/10 02:29:54 job.go:279:     raise InternalError(errno, errorvalue)
    
    2020/03/10 02:29:54 job.go:279: django.db.utils.InternalError: (1049, u"Unknown database 'bkdata_monitor_alert'")
    
    2020/03/10 02:29:55 job.go:304: error waiting for Cmd exit status 1
    

    这提示居然是没有这个名称为bkdata_monitor_alert的数据库??
    结合之前的agent日志是确认有建表操作成功的,说明是环境清理操作很可能把对应组件的库也给删除了。

    这里先不深究,直接查看当前的数据库列表:

    MySQL [(none)]> show databases;
    +--------------------+
    | Database           |
    +--------------------+
    | information_schema |
    | bk_fta_solutions   |
    | bk_log_search      |
    | bk_monitor         |
    | bk_nodeman         |
    | bk_sops            |
    | bksuite_common     |
    | job                |
    | jobLog             |
    | mysql              |
    | open_paas          |
    | performance_schema |
    | sys                |
    +--------------------+
    13 rows in set (0.00 sec)
    

    果然没有这个bkdata_monitor_alert库,这里先直接尝试创建一个空库试下:

    MySQL [(none)]> create database bkdata_monitor_alert character set utf8;
    Query OK, 1 row affected (0.01 sec)
    

    再次尝试bk_monitor的安装:

    # 再次安装bk_monitor
    [root@rbtnode1 install]# ./bk_install saas-o bk_monitor
    
    # 监控agent.log
    [root@rbtnode1 paas_agent]# pwd
    /data/bkce/logs/paas_agent
    [root@rbtnode1 paas_agent]# tail -20f agent.log 
    

    发现这次agent.log日志最终显示Job正常完成了:

    省略部分日志..
    
    2020/03/10 02:45:38 job.go:279:   Applying sessions.0001_initial... OK
    
    2020/03/10 02:45:38 job.go:279: ------change db success------
    
    2020/03/10 02:47:25 job.go:279: ------ start app server ------
    
    2020/03/10 02:47:25 job.go:279: su: ignore --preserve-environment, it's mutually exclusive to --login.
    
    2020/03/10 02:47:25 job.go:279: /etc/profile: line 77: ulimit: open files: cannot modify limit: Operation not permitted
    
    2020/03/10 02:47:25 job.go:279: /etc/profile: line 78: ulimit: open files: cannot modify limit: Operation not permitted
    
    2020/03/10 02:47:25 job.go:279: /etc/profile: line 79: ulimit: open files: cannot modify limit: Operation not permitted
    
    2020/03/10 02:47:25 job.go:279: /etc/profile: line 80: ulimit: open files: cannot modify limit: Operation not permitted
    
    2020/03/10 02:47:26 job.go:279: Last login: Mon Mar  9 14:01:54 CST 2020
    
    2020/03/10 02:47:28 job.go:279: Job Done
    
    2020/03/10 02:47:28 job.go:306: RunJob end ... ...
    

    赶紧去看下安装的窗口,发现这次bk_monitor终于安装成功了:

    [root@rbtnode1 install]# ./bk_install saas-o bk_monitor
    省略部分日志..
    
    2020-03-10 02:47:24 125  INFO    check deploy result. retry 107
    2020-03-10 02:47:26 125  INFO    check deploy result. retry 108
    2020-03-10 02:47:29 125  INFO    check deploy result. retry 109
    2020-03-10 02:47:30 131  INFO   bk_monitor have been deployed successfully
    [192.168.1.6]20200310-024730 151   SaaS application bk_monitor has been deployed successfully
    [192.168.1.6]20200310-024730 56   install saas-o(bk_monitor) done
    

    登陆蓝鲸的工作台,也确认这次蓝鲸监控组件已经安装成功,可以正常操作了。

  • 相关阅读:
    浅谈Java中的hashcode方法
    framework
    js 去掉字符串最后一个逗号:笑死我了
    .net MVC4.0项目发布到阿里云虚拟主机中遇到的问题。
    Bootstrap学习第二天轮播插件
    Bootstrap学习第一天
    图灵机器人api的使用方法含微信版本和网页版
    sql.表值类型
    asp.net中的日志添加和未处理异常的记录
    C# 结构类型与类的区别
  • 原文地址:https://www.cnblogs.com/jyzhao/p/12453294.html
Copyright © 2011-2022 走看看