zoukankan      html  css  js  c++  java
  • 修改crushmap实验

    https://www.cnblogs.com/sisimi/p/7799980.html

    CRUSH的全称是Controlled Replication Under Scalable Hashing,是ceph数据存储的分布式选择算法,也是ceph存储引擎的核心。ceph的客户端在往集群里读写数据时,动态计算数据的存储位置。这样ceph就无需维护一个叫metadata的东西,从而提高性能。

    ceph分布式存储有关键的3R: Replication(数据复制)、Recovery(数据恢复)、Rebalancing(数据均衡)。在组件故障时,ceph默认等待300秒,然后将OSD标记为down和out,并且初始化recovery操作。这个等待时间可以在集群配置文件的mon_osd_down_out_interval参数里设置。

    当新的主机或磁盘加入到集群时,CRUSH开始rebalancing操作,它将数据从存在的主机、磁盘迁移到新的主机、磁盘。rebalancing时会尽量利用所有磁盘,以提高集群性能。如果ceph集群在重度使用中,推荐做法是新加入的磁盘设置权重0,并且逐步提高权重,使得数据迁移缓慢发生,以免影响性能。所有的分布式存储在扩容时都建议这样操作。

    在实际中可能经常需要调整集群的布局。默认的CRUSH布局很简单,执行ceph osd tree命令,会看到仅有host和OSD这两种bucket类型在root下面。默认的布局对分区容错很不利,没有rack、row、room这些概念。下面我们增加一种bucket类型:rack(机架)。所有的host(主机)都应位于rack下面。

    (1)执行ceph osd tree得到当前的集群布局:

    [root@node3 ~]# ceph osd tree
    ID  CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF 
     -1       0.05878 root default                           
     -3       0.01959     host node1                         
      0   hdd 0.00980         osd.0      up  1.00000 1.00000 
      3   hdd 0.00980         osd.3      up  1.00000 1.00000 
     -5       0.01959     host node2                         
      1   hdd 0.00980         osd.1      up  1.00000 1.00000 
      4   hdd 0.00980         osd.4      up  1.00000 1.00000 
     -7       0.01959     host node3                         
      2   hdd 0.00980         osd.2      up  1.00000 1.00000 
      5   hdd 0.00980         osd.5      up  1.00000 1.00000 

    (2)增加rack:

    [root@node3 ~]# ceph osd crush add-bucket rack03 rack
    added bucket rack03 type rack to crush map
    [root@node3 ~]# ceph osd crush add-bucket rack01 rack
    added bucket rack01 type rack to crush map
    [root@node3 ~]# ceph osd crush add-bucket rack02 rack
    added bucket rack02 type rack to crush map

    (3)将host移动到rack下面:

    [root@node3 ~]# ceph osd crush move node1 rack=rack01
    moved item id -3 name 'node1' to location {rack=rack01} in crush map
    [root@node3 ~]# ceph osd crush move node2 rack=rack02
    moved item id -5 name 'node2' to location {rack=rack02} in crush map
    [root@node3 ~]# ceph osd crush move node3 rack=rack03
    moved item id -7 name 'node3' to location {rack=rack03} in crush map

    (4)将rack移动到默认的root下面:

    [root@node3 ~]# ceph osd crush move rack01 root=default
    moved item id -9 name 'rack01' to location {root=default} in crush map
    [root@node3 ~]# ceph osd crush move rack02 root=default
    moved item id -10 name 'rack02' to location {root=default} in crush map
    [root@node3 ~]# ceph osd crush move rack03 root=default
    moved item id -11 name 'rack03' to location {root=default} in crush map

    (5)再次运行ceph osd tree命令:

    [root@node3 ~]# ceph osd tree
    ID  CLASS WEIGHT  TYPE NAME          STATUS REWEIGHT PRI-AFF 
     -1       0.05878 root default                               
     -9       0.01959     rack rack01                            
     -3       0.01959         host node1                         
      0   hdd 0.00980             osd.0      up  1.00000 1.00000 
      3   hdd 0.00980             osd.3      up  1.00000 1.00000 
    -10       0.01959     rack rack02                            
     -5       0.01959         host node2                         
      1   hdd 0.00980             osd.1      up  1.00000 1.00000 
      4   hdd 0.00980             osd.4      up  1.00000 1.00000 
    -11       0.01959     rack rack03                            
     -7       0.01959         host node3                         
      2   hdd 0.00980             osd.2      up  1.00000 1.00000 
      5   hdd 0.00980             osd.5      up  1.00000 1.00000 

    会看到新的布局已产生,所有host都位于特定rack下面。按此操作,就完成了对CRUSH布局的调整。

    对一个已知对象,可以根据CRUSH算法,查找它的存储结构。比如data这个pool里有一个文件test.txt:

    [root@node3 ~]# echo "this is test! ">>test.txt
    [root@node3 ~]# rados -p data ls
    [root@node3 ~]# rados -p data put test.txt test.txt 
    [root@node3 ~]# rados -p data ls
    test.txt

    显示它的存储结构:

    [root@node3 ~]# ceph osd map data test.txt 
    osdmap e42 pool 'data' (1) object 'test.txt' -> pg 1.8b0b6108 (1.8) -> up ([3,4,2], p3) acting ([3,4,2], p3)

    crushmap与ceph的存储架构有关,在实际中可能需要经常调整它。如下先把它dump出来,再反编译成明文进行查看。

    [root@node3 ~]# ceph osd getcrushmap -o crushmap
    22
    [root@node3 ~]# crushtool -d crushmap -o crushmap
    [root@node3 ~]# cat crushmap 
    # begin crush map
    tunable choose_local_tries 0
    tunable choose_local_fallback_tries 0
    tunable choose_total_tries 50
    tunable chooseleaf_descend_once 1
    tunable chooseleaf_vary_r 1
    tunable chooseleaf_stable 1
    tunable straw_calc_version 1
    tunable allowed_bucket_algs 54
    
    # devices
    device 0 osd.0 class hdd
    device 1 osd.1 class hdd
    device 2 osd.2 class hdd
    device 3 osd.3 class hdd
    device 4 osd.4 class hdd
    device 5 osd.5 class hdd
    
    # types
    type 0 osd
    type 1 host
    type 2 chassis
    type 3 rack
    type 4 row
    type 5 pdu
    type 6 pod
    type 7 room
    type 8 datacenter
    type 9 region
    type 10 root
    
    # buckets
    host node1 {
            id -3       # do not change unnecessarily
            id -4 class hdd         # do not change unnecessarily
            # weight 0.020
            alg straw2
            hash 0  # rjenkins1
            item osd.0 weight 0.010
            item osd.3 weight 0.010
    }
    rack rack01 {
            id -9       # do not change unnecessarily
            id -14 class hdd                # do not change unnecessarily
            # weight 0.020
            alg straw2
            hash 0  # rjenkins1
            item node1 weight 0.020
    }
    host node2 {
            id -5       # do not change unnecessarily
            id -6 class hdd         # do not change unnecessarily
            # weight 0.020
            alg straw2
            hash 0  # rjenkins1
            item osd.1 weight 0.010
            item osd.4 weight 0.010
    }
    rack rack02 {
            id -10      # do not change unnecessarily
            id -13 class hdd                # do not change unnecessarily
            # weight 0.020
            alg straw2
            hash 0  # rjenkins1
            item node2 weight 0.020
    }
    host node3 {
            id -7       # do not change unnecessarily
            id -8 class hdd         # do not change unnecessarily
            # weight 0.020
            alg straw2
            hash 0  # rjenkins1
            item osd.2 weight 0.010
            item osd.5 weight 0.010
    }
    rack rack03 {
            id -11      # do not change unnecessarily
            id -12 class hdd                # do not change unnecessarily
            # weight 0.020
            alg straw2
            hash 0  # rjenkins1
            item node3 weight 0.020
    }
    root default {
            id -1       # do not change unnecessarily
            id -2 class hdd         # do not change unnecessarily
            # weight 0.059
            alg straw2
            hash 0  # rjenkins1
            item rack01 weight 0.020
            item rack02 weight 0.020
            item rack03 weight 0.020
    }
    
    # rules
    rule replicated_rule {
            id 0
            type replicated
            min_size 1
            max_size 10
            step take default
            step chooseleaf firstn 0 type host
            step emit
    }
    
    # end crush map

    这个文件包括几节,大概说明下:

    • crushmap设备:见上述文件#device后面的内容。这里列举ceph的OSD列表。不管新增还是删除OSD,这个列表会自动更新。通常你无需更改此处,ceph会自动维护。

    • crushmap bucket类型:见上述文件#types后面的内容。定义bucket的类型,包括root、datacenter、room、row、rack、host、osd等。默认的bucket类型对大部分ceph集群来说够用了,不过你也可以增加自己的类型。

    • crushmap bucket定义:见上述文件#buckets后面的内容。这里定义bucket的层次性架构,也可以定义bucket所使用的算法类型。

    • crushmap规则:见上述文件#rules后面的内容。它定义pool里存储的数据应该选择哪个相应的bucket。对较大的集群来说,有多个pool,每个pool有它自己的选择规则。

    crushmap应用的实际场景中,我们可以定义一个pool名字为SSD,它使用SSD磁盘来提高性能。再定义一个pool名字为SATA,它使用SATA磁盘来获取更好的经济性。假设有3个ceph存储node,每个node上都有独立的osd服务。

    首先在crushmap文件里修改root default为:

    root default {
            id -1           # do not change unnecessarily
            id -2 class hdd         # do not change unnecessarily
            # weight 0.059
            alg straw2
            hash 0  # rjenkins1
            item rack01 weight 0.020
    }

    主要修改其item,删除item rack02 weight 0.020,item rack03 weight 0.020 的内容

    并增加如下内容:

    root ssd {
            id -15
            alg straw
            hash 0
            item rack02 weight 0.020
    
    } 
    root sata {
            id -16
            alg straw
            hash 0
            item rack03 weight 0.020
    
    }
    
    # rules
    rule replicated_rule {
            id 0
            type replicated
            min_size 1
            max_size 10
            step take default
            step chooseleaf firstn 0 type host
            step emit
    }
    rule ssd-pool {
            ruleset 1
            type replicated
            min_size 1
            max_size 10
            step take ssd
            step chooseleaf firstn 0 type osd
            step emit
    }
    
    rule sata-pool {
            ruleset 2
            type replicated
            min_size 1
            max_size 10
            step take sata
            step chooseleaf firstn 0 type osd
            step emit

    ruleset 2这个规则里,step take sata表示优先选择sata的bucket
    ruleset 1这个规则里,step take ssd表示优先选择ssd的bucket
    需要注意的就是bucket id不要重复就好

    编译文件,并上传到集群:

    [root@node3 ~]# crushtool -c crushmap -o crushmap.new
    [root@node3 ~]# ceph osd setcrushmap -i crushmap.new
    23

    再次查看此时的集群布局:

    [root@node3 ~]# ceph osd tree
    ID  CLASS WEIGHT  TYPE NAME          STATUS REWEIGHT PRI-AFF 
    -16       0.01999 root sata                                  
    -11       0.01999     rack rack03                            
     -7       0.01999         host node3                         
      2   hdd 0.00999             osd.2      up  1.00000 1.00000 
      5   hdd 0.00999             osd.5      up  1.00000 1.00000 
    -15       0.01999 root ssd                                   
    -10       0.01999     rack rack02                            
     -5       0.01999         host node2                         
      1   hdd 0.00999             osd.1      up  1.00000 1.00000 
      4   hdd 0.00999             osd.4      up  1.00000 1.00000 
     -1       0.01999 root default                               
     -9       0.01999     rack rack01                            
     -3       0.01999         host node1                         
      0   hdd 0.00999             osd.0      up  1.00000 1.00000 
      3   hdd 0.00999             osd.3      up  1.00000 1.00000 

    接下来观察ceph -s是否健康状态OK。如果健康OK,增加2个pool:

    [root@node3 ~]# ceph osd pool create sata 64 64
    pool 'sata' created
    [root@node3 ~]# ceph osd pool create ssd 64 64
    pool 'ssd' created

    给上述2个新创建的pool分配crush规则:

    [root@node3 ~]# ceph osd pool set sata crush_rule sata-pool
    set pool 2 crush_rule to sata-pool
    [root@node3 ~]# ceph osd pool set ssd crush_rule ssd-pool
    set pool 3 crush_rule to ssd-pool

    查看规则是否生效:

    [root@node3 ~]# ceph osd dump |egrep -i "ssd|sata"
    pool 2 'sata' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 55 flags hashpspool stripe_width 0
    pool 3 'ssd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 60 flags hashpspool stripe_width 0

    现在写往sata pool的目标,将优先存储到SATA设备上。写往ssd pool的目标,将优先存储到SSD设备上。

    用rados命令进行测试:

    [root@node3 ~]# touch file.ssd
    [root@node3 ~]# touch file.sata
    [root@node3 ~]# rados -p ssd put filename file.ssd
    [root@node3 ~]# rados -p sata put filename file.sata

    最后使用ceph osd map命令检查它们的存储位置:

    [root@node3 ~]# ceph osd map ssd file.ssd 
    osdmap e69 pool 'ssd' (3) object 'file.ssd' -> pg 3.46b33220 (3.20) -> up ([4,1], p4) acting ([4,1,0], p4)
    [root@node3 ~]# ceph osd map sata file.sata
    osdmap e69 pool 'sata' (2) object 'file.sata' -> pg 2.df856dd1 (2.11) -> up ([5,2], p5) acting ([5,2,0], p5)

    可以看到对应类型的对象优先存储到对应类型的设备上

    参考文档:
    ceph学习之CRUSH
    理解 OpenStack + Ceph (7): Ceph 的基本操作和常见故障排除方法
    Ceph: mix SATA and SSD within the same box
    ceph Luminous新功能之crush class

  • 相关阅读:
    4. Android框架和工具之 android-async-http
    3. Android框架和工具之 xUtils(BitmapUtils)
    自定义多列排序:C++/Java实现
    Java套接字
    Java泛型
    线程同步之生产者消费者
    面向对象之深复制与浅复制
    谈谈多线程
    递归与尾递归
    单例模式:Instance
  • 原文地址:https://www.cnblogs.com/wangmo/p/11430813.html
Copyright © 2011-2022 走看看