zoukankan      html  css  js  c++  java
  • [ext4]空间管理

    

    在块分配机制中,涉及到几个主要的数据结构。

    通过ext4_allocation_request描述块请求,然后基于块查找结果即上层需求来决定是否执行块分配操作。

    在分配过程中,为了更好执行分配,记录一些信息,需要对分配行为进行描述,就有结构体ext4_allocation_contex

    在搜寻可用空间过程中,是有可能使用预分配空间的,因此还需要有能够描述预分配空间大小等属性的描述符ext4_prealloc_space

     

    下面,对各个关键结构体进行详细的分析。

    1. 块请求描述符ext4_allocation_request

    块分配请求属性,有请求描述符ext4_allocation_request来描述:

    structext4_allocation_request {

            /* target inode for block we'reallocating */

            struct inode *inode;

            /* how many blocks we want to allocate*/

            unsigned int len;

            /* logical block in target inode */

            ext4_lblk_t logical;

            /* the closest logical allocated blockto the left */

            ext4_lblk_t lleft;

            /* the closest logical allocated blockto the right */

            ext4_lblk_t lright;

            /* phys. target (a hint) */

            ext4_fsblk_t goal;

            /* phys. block for the closest logicalallocated block to the left */

            ext4_fsblk_t pleft;

            /* phys. block for the closest logicalallocated block to the right */

            ext4_fsblk_t pright;

            /* flags. see above EXT4_MB_HINT_* */

            unsigned int flags;

    };

    这个请求描述符结构体在ext4_ext_map_blocks()中初始化(注:ext4_ext_map_blocks()的作用是查找或分配指定的block块,并完成与缓存空间的映射)

    具体上述信息也就一个成员变量goal值的我们分析一下,goal记录是物理块号,其隐含含义比较重要:goal虽然只是记录物理块号,但是这个物理块号的选择可以很大程度的是文件保证locality特性及其物理地址连续性。

    goal是由函数ext4_ext_find_goal()来定义:

    static ext4_fsblk_t ext4_ext_find_goal(struct inode*inode,

                                    struct ext4_ext_path *path,

                                    ext4_lblk_t block)

    {

            if(path) {

                      intdepth = path->p_depth;

                      structext4_extent *ex;

     

                      /*

                       * Try to predict block placement assuming thatwe are

                       * filling in a file which will eventually be

                       * non-sparse --- i.e., in the case of libbfdwriting

                       * an ELF object sections out-of-order but in away

                       * the eventually results in a contiguousobject or

                       * executable file, or some database extendinga table

                       * space file. However, this is actually somewhat

                       * non-ideal if we are writing a sparse filesuch as

                       * qemu or KVM writing a raw image file that isgoing

                       * to stay fairly sparse, since it will end up

                       * fragmenting the file system's free space. Maybe we

                       * should have some hueristics or some way toallow

                       * userspace to pass a hint to file system,

                       * especially if the latter case turns out tobe

                       * common.

                       */

                      ex= path[depth].p_ext;

                      if(ex) {

                               ext4_fsblk_text_pblk = ext4_ext_pblock(ex);

                               ext4_lblk_text_block = le32_to_cpu(ex->ee_block);

     

                               if(block > ext_block)

                                        returnext_pblk + (block - ext_block);

                               else

                                        returnext_pblk - (ext_block - block);

                      }

     

                      /*it looks like index is empty;

                       * try to find starting block from index itself*/

                      if(path[depth].p_bh)

                               returnpath[depth].p_bh->b_blocknr;

            }

     

            /*OK. use inode's group */

            returnext4_inode_to_goal_block(inode);

    }

    细细分析这段代码,如果从根目录到指定逻辑块的path存在,那么就需要根据path来计算目标物理块的地址。

    (1) Path的终点若是dataextent,则说明该path是从根到叶子的。当请求block号大于path叶子extent的起始逻辑块号ext_block (对应物理块号为pblk),其逻辑块的距离为(block-ext_block),为在最可能上保证对应物理地址的连续性;只需返回与pblk+(block-ext_block)物理块号最接近的空闲物理块即可;而对于请求block号小于extent的起始逻辑块号ext_block的情况,只需尽最可能以pblk-( ext_block -block)物理块号为目标寻找与其物理地址最接近的空闲物理块即可。因此,我们指定goal分别为pblk+(block-ext_block)pblk-(block-ext_block)

    (2)而如果path存在,却没有叶子,那则么办,很简单,我们只需要将goal物理块号指定为最后一个的extent block对应的物理块号既可。

    (3)还有一种情况,没有给出path。个人认为,这种场景即inodecreate的情况。有专门的ext4_inode_to_goal_block()来实现:

    ext4_fsblk_t ext4_inode_to_goal_block(struct inode*inode)

    {

            structext4_inode_info *ei = EXT4_I(inode);

            ext4_group_tblock_group;

            ext4_grpblk_tcolour;

            intflex_size = ext4_flex_bg_size(EXT4_SB(inode->i_sb));

            ext4_fsblk_tbg_start;

            ext4_fsblk_tlast_block;

     

            block_group= ei->i_block_group;

            if(flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) {

                      /*

                       * If there are at leastEXT4_FLEX_SIZE_DIR_ALLOC_SCHEME

                       * block groups per flexgroup, reserve thefirst block

                       * group for directories and special files. Regular

                       * files will start at the second blockgroup. This

                       * tends to speed up directory access andimproves

                       * fsck times.

                       */

                      block_group&= ~(flex_size-1);

                      if(S_ISREG(inode->i_mode))

                               block_group++;

            }

            bg_start= ext4_group_first_block_no(inode->i_sb, block_group);

            last_block= ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es) - 1;

     

            /*

             * If we are doing delayed allocation, we don'tneed take

             * colour into account.

             */

            if(test_opt(inode->i_sb, DELALLOC))

                      returnbg_start;

     

            if(bg_start + EXT4_BLOCKS_PER_GROUP(inode->i_sb) <= last_block)

                      colour= (current->pid % 16) *

                               (EXT4_BLOCKS_PER_GROUP(inode->i_sb)/ 16);

            else

                      colour= (current->pid % 16) * ((last_block - bg_start) / 16);

            returnbg_start + colour;

    }

    其思想是:如果flex_size至少有EXT4_FLEX_SIZE_DIR_ALLOC_SCHEMEblock groups,则定义inode所在flex_group的第二个block group的首个可用block为起始物理块号bg_block

    当然,如果该flex_group的所有文件都以bg_blockgoal的,肯定会产生竞争,所以增加color的作用,目的就是加入一个随机值,降低可能带来的竞争。

    因此,最后这种情况的goal会选择inode所在flex_group中某个随机值。

    【说明:如果flex_size只有不小于EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME,则才有可能将flex_group中第一个group分离出来,用于专门存放directories和一些特殊文件,普通文件从第二个group中分配,该特可以加速directory的访问及fsck效率。】

     

    2. 分配行为描述符ext4_allocation_contex

    在分配过程中,为了更好执行分配,记录一些信息,需要对分配行为进行描述,就有结构体ext4_allocation_contex

    struct ext4_allocation_context{

            struct inode *ac_inode;

            struct super_block *ac_sb;

     

            /* original request */

            struct ext4_free_extent ac_o_ex;

     

            /* goal request (normalized ac_o_ex) */

            struct ext4_free_extent ac_g_ex;

     

            /* the best found extent */

            struct ext4_free_extent ac_b_ex;

     

            /* copy of the best found extent takenbefore preallocation efforts */

            struct ext4_free_extent ac_f_ex;

     

            __u16 ac_groups_scanned;

            __u16 ac_found;

            __u16 ac_tail;

            __u16 ac_buddy;

            __u16 ac_flags;                  /* allocation hints */

            __u8 ac_status;

            __u8 ac_criteria;

            __u8 ac_2order;                 /* if request is to allocate 2^N blocks and

                                         * N > 0, the field stores N, otherwise 0 */

            __u8 ac_op;               /* operation, for history only */

            struct page *ac_bitmap_page;

            struct page *ac_buddy_page;

            struct ext4_prealloc_space *ac_pa;

            struct ext4_locality_group *ac_lg;

    };

    这个数据结构用来描述分配上下文的属性。基于结构体ext4_allocation_request,由函数ext4_mb_initialize_context()进行初始化。

    ext4_mb_initialize_context()主要工作:利用请求描述符的信息初始化ac->ac_o_ex:申请的逻辑块号fe_logicalgoal所在的groupgoalcluster号(暂时理解为物理块号);然后将ac_g_ex赋值为ac_o_ex

    ext4_mb_normalize_request()会对ext4_allocation_contex结构体进行normalization

    1.计算file的大小size应该是i_size_read(ac->ac_inode)(offset+请求长度)中的大值,其中offset是有指定block转化而来。

    2.根据已定的算法估算文件可能的大小;

    #define NRL_CHECK_SIZE(req, size, max, chunk_size)  

                      (req<= (size) || max <= (chunk_size))

     

            /*first, try to predict filesize */

            /*XXX: should this table be tunable? */

            start_off= 0;

            if(size <= 16 * 1024) {

                      size= 16 * 1024;

            }else if (size <= 32 * 1024) {

                      size= 32 * 1024;

            }else if (size <= 64 * 1024) {

                      size= 64 * 1024;

            }else if (size <= 128 * 1024) {

                      size= 128 * 1024;

            }else if (size <= 256 * 1024) {

                      size= 256 * 1024;

            }else if (size <= 512 * 1024) {

                      size= 512 * 1024;

            }else if (size <= 1024 * 1024) {

                      size= 1024 * 1024;

            }else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, 2 * 1024)) {

                      start_off= ((loff_t)ac->ac_o_ex.fe_logical >>

                                                           (21- bsbits)) << 21;

                      size= 2 * 1024 * 1024;

            }else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, 4 * 1024)) {

                      start_off= ((loff_t)ac->ac_o_ex.fe_logical >>

                                                                    (22- bsbits)) << 22;

                      size= 4 * 1024 * 1024;

            }else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,

                                                  (8<<20)>>bsbits,max, 8 * 1024)) {

                      start_off= ((loff_t)ac->ac_o_ex.fe_logical >>

                                                                    (23- bsbits)) << 23;

                      size= 8 * 1024 * 1024;

            }else {

                      start_off= (loff_t)ac->ac_o_ex.fe_logical << bsbits;

                      size  =ac->ac_o_ex.fe_len << bsbits;

            }

            size= size >> bsbits;

            start= start_off >> bsbits;

    由此可见,预估文件大小之后得到的sizestart肯定比原来的要大一些。

    3. check一下,是否覆盖了已有的prealloc空间。(如果覆盖,那就BUG);

    4.更新ac_g_ex:根据(2)sizestart更新ac_g_ex

            ac->ac_g_ex.fe_logical= start;

            ac->ac_g_ex.fe_len= EXT4_NUM_B2C(sbi, size);

    由上可见,通过ext4_mb_normalize_request()函数主要更新了ac->ac_g_ex成员。

    ac->ac_b_ex是在ext4_mb_regular_allocator()函数初始化的,其表示可以分配的最佳的extent;隐含意思,就是就按这么分配。

    ac->ac_f_ex是在prealloc空间初始化之前保留ac_b_ex的副本,在ext4_mb_new_inode_pa()ext4_mb_new_group_pa()中定义。

     

    3. 预分配空间描述符ext4_allocation_contex

    描述预分配空间大小等属性的描述符ext4_prealloc_space

    structext4_prealloc_space {

            struct list_headpa_inode_list;

            struct list_headpa_group_list;

            union {

                      struct list_head pa_tmp_list;

                      struct rcu_headpa_rcu;

            } u;

            spinlock_t         pa_lock;

            atomic_t            pa_count;

            unsigned          pa_deleted;

            ext4_fsblk_t               pa_pstart;/*phys. block */

            ext4_lblk_t                 pa_lstart; /*log. block */

            ext4_grpblk_t            pa_len;              /*len of preallocated chunk */

            ext4_grpblk_t            pa_free;   /* howmany blocks are free */

            unsigned short         pa_type;  /* pa type.inode or group */

            spinlock_t         *pa_obj_lock;

            struct inode               *pa_inode;       /*hack, for history only */

    };

    其中有四个结构体非常重要:

    pa_lstart -> prealloc空间的起始逻辑地址(对文件而言)

    pa_pstart -> prealloc空间的起始物理地址;

    pa_len   -> prealloc空间的长度;

    pa_free  -> prealloc空间的可用长度;

    这个结构体是在函数ext4_mb_new_inode_pa()ext4_mb_new_group_pa()中初始化。

     

    暂时就分析这么几个结构体吧。


    作者:Younger Liu,

    本作品采用知识共享署名-非商业性使用-相同方式共享 3.0 未本地化版本许可协议进行许可。



  • 相关阅读:
    批量数据导入数据库方法
    Remoting简单实践
    js面向对象继承
    Linq实现t-Sql的各种连接
    数据库树状结构的关系表的删除方案
    记录一次SQL查询语句
    mvc请求过程总结
    T-sql表表达式
    各个浏览器的兼容问题及样式兼容处理(不定期补充)
    vue.js 键盘enter事件的使用
  • 原文地址:https://www.cnblogs.com/youngerchina/p/5624475.html
Copyright © 2011-2022 走看看