zoukankan      html  css  js  c++  java
  • 修改crushmap实验

    https://www.cnblogs.com/sisimi/p/7799980.html

    CRUSH的全称是Controlled Replication Under Scalable Hashing,是ceph数据存储的分布式选择算法,也是ceph存储引擎的核心。ceph的客户端在往集群里读写数据时,动态计算数据的存储位置。这样ceph就无需维护一个叫metadata的东西,从而提高性能。

    ceph分布式存储有关键的3R: Replication(数据复制)、Recovery(数据恢复)、Rebalancing(数据均衡)。在组件故障时,ceph默认等待300秒,然后将OSD标记为down和out,并且初始化recovery操作。这个等待时间可以在集群配置文件的mon_osd_down_out_interval参数里设置。

    当新的主机或磁盘加入到集群时,CRUSH开始rebalancing操作,它将数据从存在的主机、磁盘迁移到新的主机、磁盘。rebalancing时会尽量利用所有磁盘,以提高集群性能。如果ceph集群在重度使用中,推荐做法是新加入的磁盘设置权重0,并且逐步提高权重,使得数据迁移缓慢发生,以免影响性能。所有的分布式存储在扩容时都建议这样操作。

    在实际中可能经常需要调整集群的布局。默认的CRUSH布局很简单,执行ceph osd tree命令,会看到仅有host和OSD这两种bucket类型在root下面。默认的布局对分区容错很不利,没有rack、row、room这些概念。下面我们增加一种bucket类型:rack(机架)。所有的host(主机)都应位于rack下面。

    (1)执行ceph osd tree得到当前的集群布局:

    [root@node3 ~]# ceph osd tree
    ID  CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF 
     -1       0.05878 root default                           
     -3       0.01959     host node1                         
      0   hdd 0.00980         osd.0      up  1.00000 1.00000 
      3   hdd 0.00980         osd.3      up  1.00000 1.00000 
     -5       0.01959     host node2                         
      1   hdd 0.00980         osd.1      up  1.00000 1.00000 
      4   hdd 0.00980         osd.4      up  1.00000 1.00000 
     -7       0.01959     host node3                         
      2   hdd 0.00980         osd.2      up  1.00000 1.00000 
      5   hdd 0.00980         osd.5      up  1.00000 1.00000 

    (2)增加rack:

    [root@node3 ~]# ceph osd crush add-bucket rack03 rack
    added bucket rack03 type rack to crush map
    [root@node3 ~]# ceph osd crush add-bucket rack01 rack
    added bucket rack01 type rack to crush map
    [root@node3 ~]# ceph osd crush add-bucket rack02 rack
    added bucket rack02 type rack to crush map

    (3)将host移动到rack下面:

    [root@node3 ~]# ceph osd crush move node1 rack=rack01
    moved item id -3 name 'node1' to location {rack=rack01} in crush map
    [root@node3 ~]# ceph osd crush move node2 rack=rack02
    moved item id -5 name 'node2' to location {rack=rack02} in crush map
    [root@node3 ~]# ceph osd crush move node3 rack=rack03
    moved item id -7 name 'node3' to location {rack=rack03} in crush map

    (4)将rack移动到默认的root下面:

    [root@node3 ~]# ceph osd crush move rack01 root=default
    moved item id -9 name 'rack01' to location {root=default} in crush map
    [root@node3 ~]# ceph osd crush move rack02 root=default
    moved item id -10 name 'rack02' to location {root=default} in crush map
    [root@node3 ~]# ceph osd crush move rack03 root=default
    moved item id -11 name 'rack03' to location {root=default} in crush map

    (5)再次运行ceph osd tree命令:

    [root@node3 ~]# ceph osd tree
    ID  CLASS WEIGHT  TYPE NAME          STATUS REWEIGHT PRI-AFF 
     -1       0.05878 root default                               
     -9       0.01959     rack rack01                            
     -3       0.01959         host node1                         
      0   hdd 0.00980             osd.0      up  1.00000 1.00000 
      3   hdd 0.00980             osd.3      up  1.00000 1.00000 
    -10       0.01959     rack rack02                            
     -5       0.01959         host node2                         
      1   hdd 0.00980             osd.1      up  1.00000 1.00000 
      4   hdd 0.00980             osd.4      up  1.00000 1.00000 
    -11       0.01959     rack rack03                            
     -7       0.01959         host node3                         
      2   hdd 0.00980             osd.2      up  1.00000 1.00000 
      5   hdd 0.00980             osd.5      up  1.00000 1.00000 

    会看到新的布局已产生,所有host都位于特定rack下面。按此操作,就完成了对CRUSH布局的调整。

    对一个已知对象,可以根据CRUSH算法,查找它的存储结构。比如data这个pool里有一个文件test.txt:

    [root@node3 ~]# echo "this is test! ">>test.txt
    [root@node3 ~]# rados -p data ls
    [root@node3 ~]# rados -p data put test.txt test.txt 
    [root@node3 ~]# rados -p data ls
    test.txt

    显示它的存储结构:

    [root@node3 ~]# ceph osd map data test.txt 
    osdmap e42 pool 'data' (1) object 'test.txt' -> pg 1.8b0b6108 (1.8) -> up ([3,4,2], p3) acting ([3,4,2], p3)

    crushmap与ceph的存储架构有关,在实际中可能需要经常调整它。如下先把它dump出来,再反编译成明文进行查看。

    [root@node3 ~]# ceph osd getcrushmap -o crushmap
    22
    [root@node3 ~]# crushtool -d crushmap -o crushmap
    [root@node3 ~]# cat crushmap 
    # begin crush map
    tunable choose_local_tries 0
    tunable choose_local_fallback_tries 0
    tunable choose_total_tries 50
    tunable chooseleaf_descend_once 1
    tunable chooseleaf_vary_r 1
    tunable chooseleaf_stable 1
    tunable straw_calc_version 1
    tunable allowed_bucket_algs 54
    
    # devices
    device 0 osd.0 class hdd
    device 1 osd.1 class hdd
    device 2 osd.2 class hdd
    device 3 osd.3 class hdd
    device 4 osd.4 class hdd
    device 5 osd.5 class hdd
    
    # types
    type 0 osd
    type 1 host
    type 2 chassis
    type 3 rack
    type 4 row
    type 5 pdu
    type 6 pod
    type 7 room
    type 8 datacenter
    type 9 region
    type 10 root
    
    # buckets
    host node1 {
            id -3       # do not change unnecessarily
            id -4 class hdd         # do not change unnecessarily
            # weight 0.020
            alg straw2
            hash 0  # rjenkins1
            item osd.0 weight 0.010
            item osd.3 weight 0.010
    }
    rack rack01 {
            id -9       # do not change unnecessarily
            id -14 class hdd                # do not change unnecessarily
            # weight 0.020
            alg straw2
            hash 0  # rjenkins1
            item node1 weight 0.020
    }
    host node2 {
            id -5       # do not change unnecessarily
            id -6 class hdd         # do not change unnecessarily
            # weight 0.020
            alg straw2
            hash 0  # rjenkins1
            item osd.1 weight 0.010
            item osd.4 weight 0.010
    }
    rack rack02 {
            id -10      # do not change unnecessarily
            id -13 class hdd                # do not change unnecessarily
            # weight 0.020
            alg straw2
            hash 0  # rjenkins1
            item node2 weight 0.020
    }
    host node3 {
            id -7       # do not change unnecessarily
            id -8 class hdd         # do not change unnecessarily
            # weight 0.020
            alg straw2
            hash 0  # rjenkins1
            item osd.2 weight 0.010
            item osd.5 weight 0.010
    }
    rack rack03 {
            id -11      # do not change unnecessarily
            id -12 class hdd                # do not change unnecessarily
            # weight 0.020
            alg straw2
            hash 0  # rjenkins1
            item node3 weight 0.020
    }
    root default {
            id -1       # do not change unnecessarily
            id -2 class hdd         # do not change unnecessarily
            # weight 0.059
            alg straw2
            hash 0  # rjenkins1
            item rack01 weight 0.020
            item rack02 weight 0.020
            item rack03 weight 0.020
    }
    
    # rules
    rule replicated_rule {
            id 0
            type replicated
            min_size 1
            max_size 10
            step take default
            step chooseleaf firstn 0 type host
            step emit
    }
    
    # end crush map

    这个文件包括几节,大概说明下:

    • crushmap设备:见上述文件#device后面的内容。这里列举ceph的OSD列表。不管新增还是删除OSD,这个列表会自动更新。通常你无需更改此处,ceph会自动维护。

    • crushmap bucket类型:见上述文件#types后面的内容。定义bucket的类型,包括root、datacenter、room、row、rack、host、osd等。默认的bucket类型对大部分ceph集群来说够用了,不过你也可以增加自己的类型。

    • crushmap bucket定义:见上述文件#buckets后面的内容。这里定义bucket的层次性架构,也可以定义bucket所使用的算法类型。

    • crushmap规则:见上述文件#rules后面的内容。它定义pool里存储的数据应该选择哪个相应的bucket。对较大的集群来说,有多个pool,每个pool有它自己的选择规则。

    crushmap应用的实际场景中,我们可以定义一个pool名字为SSD,它使用SSD磁盘来提高性能。再定义一个pool名字为SATA,它使用SATA磁盘来获取更好的经济性。假设有3个ceph存储node,每个node上都有独立的osd服务。

    首先在crushmap文件里修改root default为:

    root default {
            id -1           # do not change unnecessarily
            id -2 class hdd         # do not change unnecessarily
            # weight 0.059
            alg straw2
            hash 0  # rjenkins1
            item rack01 weight 0.020
    }

    主要修改其item,删除item rack02 weight 0.020,item rack03 weight 0.020 的内容

    并增加如下内容:

    root ssd {
            id -15
            alg straw
            hash 0
            item rack02 weight 0.020
    
    } 
    root sata {
            id -16
            alg straw
            hash 0
            item rack03 weight 0.020
    
    }
    
    # rules
    rule replicated_rule {
            id 0
            type replicated
            min_size 1
            max_size 10
            step take default
            step chooseleaf firstn 0 type host
            step emit
    }
    rule ssd-pool {
            ruleset 1
            type replicated
            min_size 1
            max_size 10
            step take ssd
            step chooseleaf firstn 0 type osd
            step emit
    }
    
    rule sata-pool {
            ruleset 2
            type replicated
            min_size 1
            max_size 10
            step take sata
            step chooseleaf firstn 0 type osd
            step emit

    ruleset 2这个规则里,step take sata表示优先选择sata的bucket
    ruleset 1这个规则里,step take ssd表示优先选择ssd的bucket
    需要注意的就是bucket id不要重复就好

    编译文件,并上传到集群:

    [root@node3 ~]# crushtool -c crushmap -o crushmap.new
    [root@node3 ~]# ceph osd setcrushmap -i crushmap.new
    23

    再次查看此时的集群布局:

    [root@node3 ~]# ceph osd tree
    ID  CLASS WEIGHT  TYPE NAME          STATUS REWEIGHT PRI-AFF 
    -16       0.01999 root sata                                  
    -11       0.01999     rack rack03                            
     -7       0.01999         host node3                         
      2   hdd 0.00999             osd.2      up  1.00000 1.00000 
      5   hdd 0.00999             osd.5      up  1.00000 1.00000 
    -15       0.01999 root ssd                                   
    -10       0.01999     rack rack02                            
     -5       0.01999         host node2                         
      1   hdd 0.00999             osd.1      up  1.00000 1.00000 
      4   hdd 0.00999             osd.4      up  1.00000 1.00000 
     -1       0.01999 root default                               
     -9       0.01999     rack rack01                            
     -3       0.01999         host node1                         
      0   hdd 0.00999             osd.0      up  1.00000 1.00000 
      3   hdd 0.00999             osd.3      up  1.00000 1.00000 

    接下来观察ceph -s是否健康状态OK。如果健康OK,增加2个pool:

    [root@node3 ~]# ceph osd pool create sata 64 64
    pool 'sata' created
    [root@node3 ~]# ceph osd pool create ssd 64 64
    pool 'ssd' created

    给上述2个新创建的pool分配crush规则:

    [root@node3 ~]# ceph osd pool set sata crush_rule sata-pool
    set pool 2 crush_rule to sata-pool
    [root@node3 ~]# ceph osd pool set ssd crush_rule ssd-pool
    set pool 3 crush_rule to ssd-pool

    查看规则是否生效:

    [root@node3 ~]# ceph osd dump |egrep -i "ssd|sata"
    pool 2 'sata' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 55 flags hashpspool stripe_width 0
    pool 3 'ssd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 60 flags hashpspool stripe_width 0

    现在写往sata pool的目标,将优先存储到SATA设备上。写往ssd pool的目标,将优先存储到SSD设备上。

    用rados命令进行测试:

    [root@node3 ~]# touch file.ssd
    [root@node3 ~]# touch file.sata
    [root@node3 ~]# rados -p ssd put filename file.ssd
    [root@node3 ~]# rados -p sata put filename file.sata

    最后使用ceph osd map命令检查它们的存储位置:

    [root@node3 ~]# ceph osd map ssd file.ssd 
    osdmap e69 pool 'ssd' (3) object 'file.ssd' -> pg 3.46b33220 (3.20) -> up ([4,1], p4) acting ([4,1,0], p4)
    [root@node3 ~]# ceph osd map sata file.sata
    osdmap e69 pool 'sata' (2) object 'file.sata' -> pg 2.df856dd1 (2.11) -> up ([5,2], p5) acting ([5,2,0], p5)

    可以看到对应类型的对象优先存储到对应类型的设备上

  • 相关阅读:
    一个简单XQuery查询的例子
    《Microsoft Sql server 2008 Internals》读书笔记第七章Special Storage(1)
    《Microsoft Sql server 2008 Internals》读书笔记第八章The Query Optimizer(4)
    《Microsoft Sql server 2008 Internal》读书笔记第七章Special Storage(4)
    SQL Server中SMO备份数据库进度条不显示?
    《Microsoft Sql server 2008 Internal》读书笔记第七章Special Storage(5)
    《Microsoft Sql server 2008 Internal》读书笔记第七章Special Storage(3)
    《Microsoft Sql server 2008 Internal》读书笔记第八章The Query Optimizer(2)
    省市三级联动的DropDownList+Ajax的三种框架(aspnet/Jquery/ExtJs)示例
    FireFox意外崩溃时的手工恢复命令
  • 原文地址:https://www.cnblogs.com/wangmo/p/11418168.html
Copyright © 2011-2022 走看看