zoukankan      html  css  js  c++  java
  • 解决ceph节点因断开SSH远程后的造成集群网络不稳定(节点的Mon和OSD进程自动down)的问题

    故障描述:ceph节点因为断开SSH网络链接会立刻导致mon和osd守护进程自动down的问题

    观察/var/log/ceph/ceph.log的部分关键信息显示如下:

    2020-07-27 17:49:01.395696 mon.ceph-node1 (mon.0) 381808 : cluster [WRN] Health check
     update: Reduced data availability: 1 pg inactive, 5 pgs peering (PG_AVAILABILITY)
    2020-07-27 17:49:03.369683 mon.ceph-node1 (mon.0) 381809 : cluster [INF] Health check
     cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 5 pgs peeri
    ng)
    2020-07-27 17:48:55.313287 mgr.ceph-node1 (mgr.6352) 266574 : cluster [DBG] pgmap v29
    8025: 320 pgs: 22 active+undersized, 47 active+undersized+degraded, 9 peering, 242 ac
    tive+clean; 53 GiB data, 759 GiB used, 11 TiB / 12 TiB avail; 0 B/s wr, 0 op/s; 2669/
    40779 objects degraded (6.545%); 0 B/s, 0 objects/s recovering
    2020-07-27 17:48:57.314405 mgr.ceph-node1 (mgr.6352) 266575 : cluster [DBG] pgmap v29
    8027: 320 pgs: 44 stale+active+clean, 27 active+undersized, 51 active+undersized+degr
    aded, 20 peering, 178 active+clean; 53 GiB data, 759 GiB used, 11 TiB / 12 TiB avail;
     0 B/s wr, 0 op/s; 3051/40779 objects degraded (7.482%); 0 B/s, 0 objects/s recoverin
    g
    
    
    2020-07-27 17:51:02.089931 mon.ceph-node1 (mon.0) 382017 : cluster [INF] Health check
     cleared: MON_DOWN (was: 1/3 mons down, quorum ceph-node1,ceph-node2)
    
    
    2020-07-27 17:51:02.579862 mon.ceph-node1 (mon.0) 382026 : cluster [WRN] overall HEAL
    TH_WARN 4 osds down; 1 host (4 osds) down; Long heartbeat ping times on back interfac
    e seen, longest is 2171.403 msec; Long heartbeat ping times on front interface seen, 
    longest is 2171.434 msec; Degraded data redundancy: 11649/40770 objects degraded (28.
    572%), 190 pgs degraded, 181 pgs undersized
    
    
    2020-07-27 17:52:32.565545 osd.9 (osd.9) 59 : cluster [WRN] slow request osd_op(clien
    t.6400.0:370569 3.20 3:06380552:::rbd_header.172d226df4f8:head [watch unwatch cookie 
    140360537903920] snapc 0=[] ondisk+write+known_if_redirected e31947) initiated 2020-0
    7-27 17:52:01.830706 currently started
    
    
    2020-07-27 17:55:06.335968 mon.ceph-node1 (mon.0) 382428 : cluster [WRN] Health check
     failed: 2 slow ops, oldest one blocked for 31 sec, mon.ceph-node1 has slow ops (SLOW
    _OPS)
    
    2020-07-27 17:56:03.133399 osd.8 (osd.8) 25 : cluster [WRN] Monitor daemon marked osd
    .8 down, but it is still running
    
    [WRN]
    Health check update: Long heartbeat ping times on front interface seen, longest is 21297.249 msec (OSD_SLOW_PING_TIME_FRONT)
    
    2020-07-28 10:02:39.045969
    [WRN]
    Health check update: Long heartbeat ping times on back interface seen, longest is 21297.238 msec (OSD_SLOW_PING_TIME_BACK)
    

    在存在故障的节点上通过dmesg命令查看到部分的kernel的硬件信息,一般用于设备故障的诊断时使用

    [root@ceph-node3 ~]# dmesg -T | tail
    [Tue Jul 28 09:59:55 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
    [Tue Jul 28 09:59:55 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
    [Tue Jul 28 09:59:55 2020] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
    [Tue Jul 28 10:06:34 2020] IPv6: ADDRCONF(NETDEV_UP): em2: link is not ready
    [Tue Jul 28 10:06:34 2020] IPv6: ADDRCONF(NETDEV_UP): em3: link is not ready
    [Tue Jul 28 10:06:34 2020] IPv6: ADDRCONF(NETDEV_UP): em4: link is not ready
    [Tue Jul 28 10:06:34 2020] IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready
    [Tue Jul 28 10:10:29 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
    [Tue Jul 28 10:10:29 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
    [Tue Jul 28 10:10:29 2020] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
    

    对比查看其他ceph节点上的配置文件信息,发现配置参数有点不一致的问题

    vim /etc/sysconfig/network-scripts/ifcfg-ib0

    CONNECTED_MODE=no
    TYPE=InfiniBand
    PROXY_METHOD=none
    BROWSER_ONLY=no
    BOOTPROTO=static
    DEFROUTE=yes
    IPV4_FAILURE_FATAL=no
    IPV6INIT=yes
    IPV6_AUTOCONF=yes
    IPV6_DEFROUTE=yes
    IPV6_FAILURE_FATAL=no
    IPV6_ADDR_GEN_MODE=stable-privacy
    NAME=ib0
    UUID=2ab4abde-b8a5-6cbc-19b1-2bfb193e4e89
    DEVICE=ib0
    ONBOOT=yes
    IPADDR=10.0.0.20
    NETMASK=255.255.255.0
    #USERS=ROOT	//多个此参数,与其他节点上有不同,于是删除了此参数
    

    修改后重启network服务和NetworkManager服务,发现描述的故障已经解除。再次使用dmesg也查看不到最新的错误信息。USERS=ROOT这个参数的作用暂时还不明确?

  • 相关阅读:
    [我的阿里云服务器] —— FTP配置
    [我的阿里云服务器] —— WorkPress
    现阶段状态,三年的门槛!!!
    dubbo初体验
    SpringBoot简易搭建
    javascript数组特性
    linux修改root账户的用户名所得的教训
    记一次虚拟机无法访问主机的坑
    javascript真假(true/false)值
    Java集合基本概念及元素添加
  • 原文地址:https://www.cnblogs.com/ashjo009/p/13391587.html
Copyright © 2011-2022 走看看