zoukankan      html  css  js  c++  java
  • Vertica节点宕机处理一例

    Vertica节点宕机处理一例:

    1. 查询数据库版本和各节点状态
    2. 常规方式启动宕机节点失败
    3. 进一步查看宕机节点的详细日志
    4. 定位问题并解决

    1. 查询数据库版本和各节点状态

    ``` dbadmin=> select version(); version ------------------------------------ Vertica Analytic Database v6.1.3-7 (1 row)

    dbadmin=> select node_name, node_id, node_state, node_address from nodes;
    node_name | node_id | node_state | node_address
    --------------------+-------------------+------------+---------------
    v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
    v_xxxxxxx_node0002 | 45035996273719008 | DOWN | 192.168.xx.xx
    v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
    v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
    v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
    (5 rows)

    
    <h1 id="2"> 2. 常规方式启动宕机节点失败</h1>
    [常规方式启动宕机节点](http://www.cnblogs.com/jyzhao/p/3855601.html)失败,瞬间返回主界面,查询到报错如下:
    

    *** Restarting hosts for database xxxxxxx ***
    restart host 192.168.xx.xx with catalog v_xxxxxxx_node0002_catalog and data v_xxxxxxx_node0002_data
    issuing multi-node restart
    Spread does not seem to be running on 192.168.xx.xx. The database will not be started on this host.

    The following host(s) are not available: 192.168.xx.xx.
    You should get them running first. Operation can not be completed.
    result of multi-node restart: K-safe parameters not met.
    Restart Hosts result: K-safe parameters not met.

    
    <h1 id="3"> 3. 进一步查看宕机节点的详细日志</h1>
    发现/opt/vertica/log/adminTools-dbadmin.log中有这么一段错误日志:
    

    Apr 16 10:55:23 Error code 1 []
    Apr 16 10:56:19 dbadmin@192.168.xx.xx: /opt/vertica/bin/vertica --status -D /Vertica/xxxxxxx/v_xxxxxxx_node0001_catalog
    Apr 16 10:56:19 Error code 1 ['vertica process is not running']
    Apr 16 10:56:19 dbadmin@192.168.xx.xx: ps -aef | grep /opt/vertica/bin/vertica | grep "-D /Vertica/xxxxxxx/v_xxxxxxx_node0001_catalog" | grep -v "ps -aef"

    
    <h1 id="4"> 4. 定位问题并解决 </h1>
    基本确定是宕机节点的spread进程当前没有正常运行。
    那么如何启动spread进程呢? 
    spread在Linux中是以服务的形式存在的。
    

    /etc/init.d/spreadd status
    /etc/init.d/spreadd start
    /etc/init.d/spreadd stop

    ## 4.1 spread进程状态 ##
    

    [root@Vertica02 log]# /etc/init.d/spreadd status
    spread 已死,但 pid 文件仍存
    Try using 'spreadd stop' to clear state

    而正常节点的spread服务应该是正常运行的:
    

    [root@Vertica01 ~]# /etc/init.d/spreadd status
    spread (pid 19256) 正在运行...

    
    ## 4.2 尝试启动spread进程 ##
    

    [root@Vertica02 log]# /etc/init.d/spreadd start
    Starting spread daemon: [失败]

    按提示尝试stop
    [root@Vertica02 log]# /etc/init.d/spreadd stop
    Stopping spread daemon: [失败]

    [root@Vertica02 log]# /etc/init.d/spreadd help
    用法:/etc/init.d/spreadd {start|stop|status|restart|condrestart}

    [root@Vertica02 log]# /etc/init.d/spreadd restart
    Stopping spread daemon: [失败]

    Starting spread daemon: spread (pid 53230) 正在运行...
    [确定]
    [root@Vertica02 log]#
    [root@Vertica02 log]# /etc/init.d/spreadd status
    spread (pid 53230) 正在运行...

    
    ## 4.3 验证spread进程已经正常运行 ##
    

    [root@Vertica02 log]# ps -ef|grep spread|grep -v grep
    spread 53230 1 0 09:43 ? 00:00:00 /opt/vertica/spread/sbin/spread -n N192168062089 -c /opt/vertica/config/vspread.conf

    spread进程起来后,然后就可以再次尝试[常规方式启动恢复宕机节点](http://www.cnblogs.com/jyzhao/p/3855601.html)了。
    
    确定宕机节点已经在RECOVERING.
    

    dbadmin=> select node_name, node_id, node_state, node_address from nodes;
    node_name | node_id | node_state | node_address
    --------------------+-------------------+------------+---------------
    v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
    v_xxxxxxx_node0002 | 45035996273719008 | RECOVERING | 192.168.xx.xx
    v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
    v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
    v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
    (5 rows)

    当宕机节点的状态由RECOVERING->UP,即可确定恢复操作已完成。
    
    ## 4.4 尝试改用第二种恢复方案进行恢复 ##
    很遗憾发现常规恢复的第一种方案无法成功(恢复整晚10小时+未成功)。
    而估计的恢复时间,dstat监控宕机节点的网络接受流量速率以及数据目录的大小增加速率。
    初步估计平均100M/s的速度copy恢复,1.3T数据量全部恢复大致也就需要4个小时。
    故尝试变更为第二种方案进行恢复,即清空宕机节点所有文件完全恢复。之前的总结只说了思路,这里简单记录下这个恢复过程。
    ### 1.停掉RECOVERING的节点。 ###
    常规停止不行就kill掉,均在admintools工具中可以操作。
    

    dbadmin=> select node_name, node_id, node_state, node_address from nodes;
    node_name | node_id | node_state | node_address
    --------------------+-------------------+------------+---------------
    v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
    v_xxxxxxx_node0002 | 45035996273719008 | DOWN | 192.168.xx.xx
    v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
    v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
    v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
    (5 rows)

    ### 2.宕机节点原Vertica目录mv重命名xxxxxxx_old,然后后台删除这个目录(这步是为了尽快进入恢复阶段)。 ###
    `nohup rm -rf /Vertica/xxxxxxx_old &`
    ### 3.重新建立目录(注意权限),拷贝vertica.conf到catalog目录中。 ###
    `mkdir -p /Vertica/xxxxxxx/v_xxxxxxx_node0002_catalog && mkdir -p /Vertica/xxxxxxx/v_xxxxxxx_node0002_data`
    
    ### 4.节点1admintools工具启动宕机节点,进入恢复状态。 ###
    

    *** Restarting hosts for database xxxxxxx ***
    restart host 192.168.xx.xx with catalog v_xxxxxxx_node0002_catalog and data v_xxxxxxx_node0002_data
    issuing multi-node restart
    Node Status: v_xxxxxxx_node0002: (DOWN)
    Node Status: v_xxxxxxx_node0002: (DOWN)
    Node Status: v_xxxxxxx_node0002: (INITIALIZING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Node Status: v_xxxxxxx_node0002: (RECOVERING)
    Nodes UP: v_xxxxxxx_node0001, v_xxxxxxx_node0003, v_xxxxxxx_node0005, v_xxxxxxx_node0004
    Nodes DOWN: v_xxxxxxx_node0002 (may be still initializing).
    result of multi-node restart: 7
    Restart Hosts result: 7
    Vertica Analytic Database 6.1.3-7 Administration Tools

    ### 5.关注恢复状态。 ###
    

    dbadmin=> select node_name, node_id, node_state, node_address from nodes;
    node_name | node_id | node_state | node_address
    --------------------+-------------------+------------+---------------
    v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
    v_xxxxxxx_node0002 | 45035996273719008 | RECOVERING | 192.168.xx.xx
    v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
    v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
    v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
    (5 rows)

    同样,当宕机节点的状态由RECOVERING->UP,即可确定恢复操作已完成。
    
    又遇到小插曲,总共单节点1.3T的数据恢复到1.2T的时候,不动了。
    

    $ df -h /Vertica/
    文件系统 容量 已用 可用 已用%% 挂载点
    /dev/mapper/vg_vertica02-LogVol00
    3.6T 1.2T 2.3T 34% /Vertica

    
    此时dstat的监控信息看到,网络拷贝的流量同时几乎没有了。
    恢复过程中发现有入库程序在跑,停掉入库程序重新恢复。
    另外考虑到数据量,恢复前先删除了部分大表的历史分区,以缩短时间,最终恢复成功。
    

    dbadmin=> select node_name, node_id, node_state, node_address from nodes;
    node_name | node_id | node_state | node_address
    --------------------+-------------------+------------+---------------
    v_xxxxxxx_node0001 | 45035996273704980 | UP | 192.168.xx.xx
    v_xxxxxxx_node0002 | 45035996273719008 | UP | 192.168.xx.xx
    v_xxxxxxx_node0003 | 45035996273719012 | UP | 192.168.xx.xx
    v_xxxxxxx_node0004 | 45035996273719016 | UP | 192.168.xx.xx
    v_xxxxxxx_node0005 | 45035996273719020 | UP | 192.168.xx.xx
    (5 rows)

  • 相关阅读:
    HDOJ 1846 Brave Game
    并查集模板
    HDU 2102 A计划
    POJ 1426 Find The Multiple
    POJ 3278 Catch That Cow
    POJ 1321 棋盘问题
    CF 999 C.Alphabetic Removals
    CF 999 B. Reversing Encryption
    string的基础用法
    51nod 1267 4个数和为0
  • 原文地址:https://www.cnblogs.com/jyzhao/p/4543555.html
Copyright © 2011-2022 走看看