zoukankan      html  css  js  c++  java
  • ceph集群的pg 不一致报错处理

    pg 不一致报错处理

    1 scrub errors; Possible data damage: 1 pg inconsistent

     HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
    OSD_SCRUB_ERRORS 1 scrub errors
    PG_DAMAGED Possible data damage: 1 pg inconsistent
        pg 1.7fff is active+clean+scrubbing+deep+inconsistent+repair, acting [184,229]
    

    报错信息整理

    • 问题GP: 1.7fff
    • osd编号: 184 229

    修复动作

    1. 执行常规修复

      ceph pg repair 1.7fff

    2. 查看修复结果

      ceph health detail

      HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
      OSD_SCRUB_ERRORS 1 scrub errors
      PG_DAMAGED Possible data damage: 1 pg inconsistent
      pg 1.7fff is active+clean+scrubbing+deep+inconsistent+repair, acting [184,229]

      报错依然

    3. 观察集群动作

      ceph -w

      2020-09-05 09:13:25.818257 osd.184 [ERR] 1.7fff repair : stat mismatch, got 9855/9856 objects, 0/0 clones, 9855/9856 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 41285080957/41289275261 bytes, 0/0 hit_set_archive bytes.
      2020-09-05 09:13:25.818757 osd.184 [ERR] 1.7fff repair 1 errors, 1 fixed
      2020-09-05 09:13:31.318617 mon.cb-mon-38 [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 1 scrub errors)
      2020-09-05 09:13:31.321338 mon.cb-mon-38 [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent)
      2020-09-05 09:13:31.321983 mon.cb-mon-38 [INF] Cluster is now healthy
      2020-09-05 10:00:00.001158 mon.cb-mon-38 [INF] overall HEALTH_OK
      

    其它修复方式

    1. 洗刷一个pg组,执行命令:

    ceph pg scrub 1.7fff  
    ceph pg deep-scrub  1.7fff
    ceph pg repair 1.7fff
    

    2.修复关联的osd

    ceph osd repair 184
    ceph osd repair 229
    

    3.关闭pg所在的主osd

    • 查询pg所在主osd
    root@manager1:~# ceph pg 1.7fff query|grep primary
                "same_primary_since": 1070,
                    "num_objects_missing_on_primary": 0,
                "up_primary": 184,
                "acting_primary": 184
                    "same_primary_since": 0,
                        "num_objects_missing_on_primary": 0,
                    "up_primary": -1,
                    "acting_primary": -1
    
    • 查询osd所在主机

      root@manager1:~# ceph osd tree|grep -B25 184
      -41        218.29431     host cc-d-19                          19   hdd    9.09560         osd.19       up  1.00000 1.00000 
       39   hdd    9.09560         osd.39       up  1.00000 1.00000 
       52   hdd    9.09560         osd.52       up  1.00000 1.00000 
       70   hdd    9.09560         osd.70       up  1.00000 1.00000 
       87   hdd    9.09560         osd.87       up  1.00000 1.00000 
      106   hdd    9.09560         osd.106      up  1.00000 1.00000 
      130   hdd    9.09560         osd.130      up  1.00000 1.00000 
      151   hdd    9.09560         osd.151      up  1.00000 1.00000 
      164   hdd    9.09560         osd.164      up  1.00000 1.00000 
      184   hdd    9.09560         osd.184      up  1.00000 1.00000 
      
    • 关闭对应的osd服务 [数据恢复会很慢,也会影响集群速度]

      systemctl stop ceph-osd@184
    
    • 恢复完成后再次修复集群即可.

      ceph pg repair 2.37c

  • 相关阅读:
    Linux下启动时间优化专题
    如何展开Linux Memory Management学习?
    《Systems Performance》阅读笔记及收获
    Linux时间子系统之四:Timer在用户和内核空间流程
    Win7 下安装ubuntu14.04双系统
    ubuntu 上使用valgrind
    Observer模式
    从C++到java
    gcc 0长数组学习
    Linux中10个有用的命令行补齐命令
  • 原文地址:https://www.cnblogs.com/lovesKey/p/13617370.html
Copyright © 2011-2022 走看看