zoukankan      html  css  js  c++  java
  • 记一次ceph pg unfound处理过程

    今天检查ceph集群,发现有pg丢失,于是就有了本文~~~

    1.查看集群状态

    [root@k8snode001 ~]# ceph health detail
    HEALTH_ERR 1/973013 objects unfound (0.000%); 17 scrub errors; Possible data damage: 1 pg recovery_unfound, 8 pgs inconsistent, 1 pg repair; Degraded data redundancy: 1/2919039 objects degraded (0.000%), 1 pg degraded
    OBJECT_UNFOUND 1/973013 objects unfound (0.000%)
        pg 2.2b has 1 unfound objects
    OSD_SCRUB_ERRORS 17 scrub errors
    PG_DAMAGED Possible data damage: 1 pg recovery_unfound, 8 pgs inconsistent, 1 pg repair
        pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound
        pg 2.44 is active+clean+inconsistent, acting [14,8,21]
        pg 2.73 is active+clean+inconsistent, acting [25,14,8]
        pg 2.80 is active+clean+scrubbing+deep+inconsistent+repair, acting [4,8,14]
        pg 2.83 is active+clean+inconsistent, acting [14,13,6]
        pg 2.ae is active+clean+inconsistent, acting [14,3,2]
        pg 2.c4 is active+clean+inconsistent, acting [8,21,14]
        pg 2.da is active+clean+inconsistent, acting [23,14,15]
        pg 2.fa is active+clean+inconsistent, acting [14,23,25]
    PG_DEGRADED Degraded data redundancy: 1/2919039 objects degraded (0.000%), 1 pg degraded
        pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound
    

    从输出发现pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound
    现在我们来查看pg 2.2b,看看这个pg得想想信息。

    [root@k8snode001 ~]# ceph pg dump_json pools    |grep 2.2b
    dumped all
    2.2b       2487                  1        1         0       1  9533198403 3048     3048                active+recovery_unfound+degraded 2020-07-23 08:56:07.669903  10373'5448370  10373:7312614  [14,22,4]         14  [14,22,4]             14  10371'5437258 2020-07-23 08:56:06.637012   10371'5437258 2020-07-23 08:56:06.637012             0
    

    可以看到它现在只有一个副本

    2.查看pg map

    [root@k8snode001 ~]# ceph pg map 2.2b
    osdmap e10373 pg 2.2b (2.2b) -> up [14,22,4] acting [14,22,4]
    

    从pg map可以看出,pg 2.2b分布到osd [14,22,4]上

    3.查看存储池状态

    [root@k8snode001 ~]# ceph osd pool stats k8s-1
    pool k8s-1 id 2
      1/1955664 objects degraded (0.000%)
      1/651888 objects unfound (0.000%)
      client io 271 KiB/s wr, 0 op/s rd, 52 op/s wr
    
    [root@k8snode001 ~]# ceph osd pool ls detail|grep k8s-1
    pool 2 'k8s-1' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 88 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
    

    4.尝试恢复pg 2.2b丢失的块

    [root@k8snode001 ~]# ceph pg repair 2.2b
    

    如果一直修复不成功,可以查看卡住PG的具体信息,主要关注recovery_state,命令如下

    [root@k8snode001 ~]# ceph pg 2.2b  query
    {
        "......
        "recovery_state": [
            {
                "name": "Started/Primary/Active",
                "enter_time": "2020-07-21 14:17:05.855923",
                "might_have_unfound": [],
                "recovery_progress": {
                    "backfill_targets": [],
                    "waiting_on_backfill": [],
                    "last_backfill_started": "MIN",
                    "backfill_info": {
                        "begin": "MIN",
                        "end": "MIN",
                        "objects": []
                    },
                    "peer_backfill_info": [],
                    "backfills_in_flight": [],
                    "recovering": [],
                    "pg_backend": {
                        "pull_from_peer": [],
                        "pushing": []
                    }
                },
                "scrub": {
                    "scrubber.epoch_start": "10370",
                    "scrubber.active": false,
                    "scrubber.state": "INACTIVE",
                    "scrubber.start": "MIN",
                    "scrubber.end": "MIN",
                    "scrubber.max_end": "MIN",
                    "scrubber.subset_last_update": "0'0",
                    "scrubber.deep": false,
                    "scrubber.waiting_on_whom": []
                }
            },
            {
                "name": "Started",
                "enter_time": "2020-07-21 14:17:04.814061"
            }
        ],
        "agent_state": {}
    }
    

    如果repair修复不了;两种解决方案,回退旧版或者直接删除

    5.解决方案

    回退旧版
    [root@k8snode001 ~]# ceph pg  2.2b  mark_unfound_lost revert
    直接删除
    [root@k8snode001 ~]# ceph pg  2.2b  mark_unfound_lost delete
    

    6.验证

    我这里直接删除了,然后ceph集群重建pg,稍等会再看,pg状态变为active+clean

    [root@k8snode001 ~]#  ceph pg  2.2b query
    {
        "state": "active+clean",
        "snap_trimq": "[]",
        "snap_trimq_len": 0,
        "epoch": 11069,
        "up": [
            12,
            22,
            4
        ],
    

    再次查看集群状态

    [root@k8snode001 ~]# ceph health detail
    HEALTH_OK
    
  • 相关阅读:
    Sublime Text shift+ctrl妙用(转载)
    编写一致的符合习惯的javascript
    Vue 部署单页应用,刷新页面 404/502 报错
    http 缓存策略浅析
    Vue 项目优化,持续更新...
    web安全之——XSS、CSRF
    javascript 易错知识点合集
    深入理解 js this 绑定机制
    javascript 模块依赖管理的本质
    VUE 项目刷新路由指向index.html
  • 原文地址:https://www.cnblogs.com/scofield666/p/14330174.html
Copyright © 2011-2022 走看看