zoukankan      html  css  js  c++  java
  • Vertica节点宕机处理一例

    Vertica节点宕机处理一例:

    1. 查询数据库版本和各节点状态
    2. 常规方式启动宕机节点失败
    3. 进一步查看宕机节点的详细日志
    4. 定位问题并解决

    1. 查询数据库版本和各节点状态

    ``` dbadmin=> select version(); version ------------------------------------ Vertica Analytic Database v6.1.3-7 (1 row)

    dbadmin=> select node_name, node_id, node_state, node_address from nodes;
    node_name | node_id | node_state | node_address
    --------------------+-------------------+------------+---------------
    v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
    v_xxxxxxx_node0002 | 45035996273719008 | DOWN | 192.168.xx.xx
    v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
    v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
    v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
    (5 rows)

    
    <h1 id="2"> 2. 常规方式启动宕机节点失败</h1>
    [常规方式启动宕机节点](http://www.cnblogs.com/jyzhao/p/3855601.html)失败,瞬间返回主界面,查询到报错如下:
    

    *** Restarting hosts for database xxxxxxx ***
    restart host 192.168.xx.xx with catalog v_xxxxxxx_node0002_catalog and data v_xxxxxxx_node0002_data
    issuing multi-node restart
    Spread does not seem to be running on 192.168.xx.xx. The database will not be started on this host.

    The following host(s) are not available: 192.168.xx.xx.
    You should get them running first. Operation can not be completed.
    result of multi-node restart: K-safe parameters not met.
    Restart Hosts result: K-safe parameters not met.

    
    <h1 id="3"> 3. 进一步查看宕机节点的详细日志</h1>
    发现/opt/vertica/log/adminTools-dbadmin.log中有这么一段错误日志:
    

    Apr 16 10:55:23 Error code 1 []
    Apr 16 10:56:19 dbadmin@192.168.xx.xx: /opt/vertica/bin/vertica --status -D /Vertica/xxxxxxx/v_xxxxxxx_node0001_catalog
    Apr 16 10:56:19 Error code 1 ['vertica process is not running']
    Apr 16 10:56:19 dbadmin@192.168.xx.xx: ps -aef | grep /opt/vertica/bin/vertica | grep "-D /Vertica/xxxxxxx/v_xxxxxxx_node0001_catalog" | grep -v "ps -aef"

    
    <h1 id="4"> 4. 定位问题并解决 </h1>
    基本确定是宕机节点的spread进程当前没有正常运行。
    那么如何启动spread进程呢? 
    spread在Linux中是以服务的形式存在的。
    

    /etc/init.d/spreadd status
    /etc/init.d/spreadd start
    /etc/init.d/spreadd stop

    ## 4.1 spread进程状态 ##
    

    [root@Vertica02 log]# /etc/init.d/spreadd status
    spread 已死,但 pid 文件仍存
    Try using 'spreadd stop' to clear state

    而正常节点的spread服务应该是正常运行的:
    

    [root@Vertica01 ~]# /etc/init.d/spreadd status
    spread (pid 19256) 正在运行...

    
    ## 4.2 尝试启动spread进程 ##
    

    [root@Vertica02 log]# /etc/init.d/spreadd start
    Starting spread daemon: [失败]

    按提示尝试stop
    [root@Vertica02 log]# /etc/init.d/spreadd stop
    Stopping spread daemon: [失败]

    [root@Vertica02 log]# /etc/init.d/spreadd help
    用法:/etc/init.d/spreadd {start|stop|status|restart|condrestart}

    [root@Vertica02 log]# /etc/init.d/spreadd restart
    Stopping spread daemon: [失败]

    Starting spread daemon: spread (pid 53230) 正在运行...
    [确定]
    [root@Vertica02 log]#
    [root@Vertica02 log]# /etc/init.d/spreadd status
    spread (pid 53230) 正在运行...

    
    ## 4.3 验证spread进程已经正常运行 ##
    

    [root@Vertica02 log]# ps -ef|grep spread|grep -v grep
    spread 53230 1 0 09:43 ? 00:00:00 /opt/vertica/spread/sbin/spread -n N192168062089 -c /opt/vertica/config/vspread.conf

    spread进程起来后,然后就可以再次尝试[常规方式启动恢复宕机节点](http://www.cnblogs.com/jyzhao/p/3855601.html)了。
    
    确定宕机节点已经在RECOVERING.
    

    dbadmin=> select node_name, node_id, node_state, node_address from nodes;
    node_name | node_id | node_state | node_address
    --------------------+-------------------+------------+---------------
    v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
    v_xxxxxxx_node0002 | 45035996273719008 | RECOVERING | 192.168.xx.xx
    v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
    v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
    v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
    (5 rows)

    当宕机节点的状态由RECOVERING->UP,即可确定恢复操作已完成。
    
    ## 4.4 尝试改用第二种恢复方案进行恢复 ##
    很遗憾发现常规恢复的第一种方案无法成功(恢复整晚10小时+未成功)。
    而估计的恢复时间,dstat监控宕机节点的网络接受流量速率以及数据目录的大小增加速率。
    初步估计平均100M/s的速度copy恢复,1.3T数据量全部恢复大致也就需要4个小时。
    故尝试变更为第二种方案进行恢复,即清空宕机节点所有文件完全恢复。之前的总结只说了思路,这里简单记录下这个恢复过程。
    ### 1.停掉RECOVERING的节点。 ###
    常规停止不行就kill掉,均在admintools工具中可以操作。
    

    dbadmin=> select node_name, node_id, node_state, node_address from nodes;
    node_name | node_id | node_state | node_address
    --------------------+-------------------+------------+---------------
    v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
    v_xxxxxxx_node0002 | 45035996273719008 | DOWN | 192.168.xx.xx
    v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
    v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
    v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
    (5 rows)

    ### 2.宕机节点原Vertica目录mv重命名xxxxxxx_old,然后后台删除这个目录(这步是为了尽快进入恢复阶段)。 ###
    `nohup rm -rf /Vertica/xxxxxxx_old &`
    ### 3.重新建立目录(注意权限),拷贝vertica.conf到catalog目录中。 ###
    `mkdir -p /Vertica/xxxxxxx/v_xxxxxxx_node0002_catalog && mkdir -p /Vertica/xxxxxxx/v_xxxxxxx_node0002_data`
    
    ### 4.节点1admintools工具启动宕机节点,进入恢复状态。 ###
    

    *** Restarting hosts for database xxxxxxx ***
    restart host 192.168.xx.xx with catalog v_xxxxxxx_node0002_catalog and data v_xxxxxxx_node0002_data
    issuing multi-node restart
    Node Status: v_xxxxxxx_node0002: (DOWN)
    Node Status: v_xxxxxxx_node0002: (DOWN)
    Node Status: v_xxxxxxx_node0002: (INITIALIZING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Nodes UP: v_xxxxxxx_node0001, v_xxxxxxx_node0003, v_xxxxxxx_node0005, v_xxxxxxx_node0004
    Nodes DOWN: v_xxxxxxx_node0002 (may be still initializing).
    result of multi-node restart: 7
    Restart Hosts result: 7
    Vertica Analytic Database 6.1.3-7 Administration Tools

    ### 5.关注恢复状态。 ###
    

    dbadmin=> select node_name, node_id, node_state, node_address from nodes;
    node_name | node_id | node_state | node_address
    --------------------+-------------------+------------+---------------
    v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
    v_xxxxxxx_node0002 | 45035996273719008 | RECOVERING | 192.168.xx.xx
    v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
    v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
    v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
    (5 rows)

    同样,当宕机节点的状态由RECOVERING->UP,即可确定恢复操作已完成。
    
    又遇到小插曲,总共单节点1.3T的数据恢复到1.2T的时候,不动了。
    

    $ df -h /Vertica/
    文件系统 容量 已用 可用 已用%% 挂载点
    /dev/mapper/vg_vertica02-LogVol00
    3.6T 1.2T 2.3T 34% /Vertica

    
    此时dstat的监控信息看到,网络拷贝的流量同时几乎没有了。
    恢复过程中发现有入库程序在跑,停掉入库程序重新恢复。
    另外考虑到数据量,恢复前先删除了部分大表的历史分区,以缩短时间,最终恢复成功。
    

    dbadmin=> select node_name, node_id, node_state, node_address from nodes;
    node_name | node_id | node_state | node_address
    --------------------+-------------------+------------+---------------
    v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
    v_xxxxxxx_node0002 | 45035996273719008 | UP | 192.168.xx.xx
    v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
    v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
    v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
    (5 rows)

  • 相关阅读:
    MVC常见问题小总结
    IIS 7的备份与恢复
    [译]Professional ASP.NET MVC3(03) Chapter 1:Getting Started(下)
    跟小静读CLR via C#(18)——Enum
    跟小静读CLR via C#(17)接口
    跟小静学MVC3[02]从注册模块实战MVC新特性
    高性能网站14条——读《高性能网站建设指南》
    [译]Professional ASP.NET MVC3(02) Chapter 1:Getting Started(中)
    [译]Professional ASP.NET MVC3(01)Chapter 1:Getting Started(上)
    从零开始MVC3—Music Store实例&Controller
  • 原文地址:https://www.cnblogs.com/jyzhao/p/4543555.html
Copyright © 2011-2022 走看看