zoukankan      html  css  js  c++  java
  • 内存管理中关于Movable的理解

    内核中的管理区

    内核中定义了如下一些管理区zone:

    enum zone_type {
    #ifdef CONFIG_ZONE_DMA
        /*
         * ZONE_DMA is used when there are devices that are not able
         * to do DMA to all of addressable memory (ZONE_NORMAL). Then we
         * carve out the portion of memory that is needed for these devices.
         * The range is arch specific.
         *
         * Some examples
         *
         * Architecture     Limit
         * ---------------------------
         * parisc, ia64, sparc  <4G
         * s390         <2G
         * arm          Various
         * alpha        Unlimited or 0-16MB.
         *
         * i386, x86_64 and multiple other arches
         *          <16M.
         */
        ZONE_DMA,
    #endif
    #ifdef CONFIG_ZONE_DMA32
        /*
         * x86_64 needs two ZONE_DMAs because it supports devices that are
         * only able to do DMA to the lower 16M but also 32 bit devices that
         * can only do DMA areas below 4G.
         */
        ZONE_DMA32,
    #endif
        /*
         * Normal addressable memory is in ZONE_NORMAL. DMA operations can be
         * performed on pages in ZONE_NORMAL if the DMA devices support
         * transfers to all addressable memory.
         */
        ZONE_NORMAL,
    #ifdef CONFIG_HIGHMEM
        /*
         * A memory area that is only addressable by the kernel through
         * mapping portions into its own address space. This is for example
         * used by i386 to allow the kernel to address the memory beyond
         * 900MB. The kernel will set up special mappings (page
         * table entries on i386) for each page that the kernel needs to
         * access.
         */
        ZONE_HIGHMEM,
    #endif
        ZONE_MOVABLE,
        __MAX_NR_ZONES
    };
    
    
    • ZONE_DMA
      该管理区是一些设备无法使用DMA访问所有地址的范围,因此特意划分出来的一块内存,专门用于特殊DMA访问分配使用的区域。比如x86架构此区域为0-16M
    • ZONE_NORMAL
      NORMAL区域是直接映射区。
    • ZONE_HIGHMEM
      高端内存管理区,申请的内存,需要内核进行map后才能访问。对于64bit Arch架构,我们一般不需要高端内存区,因为地址空间足够映射所有的物理内存。
    • ZONE_MOVABLE
      这个区域是一个特殊的存在,主要是为了支持memory hotplug功能,所以MOVABLE表示可移除,其实它也表示可迁移。

    简单来说,可迁移的页面不一定都在ZONE_MOVABLE中,但是ZONE_MOVABLE中的也页面必须都是可迁移的,我们通过查看/proc/pagetypeinfo来看下实例:

    xie:/proc # cat pagetypeinfo                                                 
    Page block order: 10
    Pages per block:  1024
    
    Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10 
    Node    0, zone      DMA, type    Unmovable     76     50     24     20     27     25     19      3      1      2      0 
    Node    0, zone      DMA, type      Movable    117     35     28    172    281     93     49     21      7      4      4 
    Node    0, zone      DMA, type  Reclaimable      0      3      1      0      0      0      0      1      0      1      0 
    Node    0, zone      DMA, type          CMA   3380   1798    856    386    152     55     21      8      4      0      0 
    Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
    Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
    Node    0, zone   Normal, type    Unmovable    521    654    531    286    132     52     15      2      1      4      0 
    Node    0, zone   Normal, type      Movable      1      8     21     21      1      1      5      3      1      0      0 
    Node    0, zone   Normal, type  Reclaimable     18     24      1      1      0      0      1      0      1      0      0 
    Node    0, zone   Normal, type          CMA      9      0      1      6      2      0      1      0      0      0      0 
    Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
    Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
    Node    0, zone  Movable, type    Unmovable      0      0      0      0      0      0      0      0      0      0      0 
    Node    0, zone  Movable, type      Movable    963    649    188     48     24    112     49     21      8      3     50 
    Node    0, zone  Movable, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
    Node    0, zone  Movable, type          CMA      0      0      0      0      0      0      0      0      0      0      0 
    Node    0, zone  Movable, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
    Node    0, zone  Movable, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
    
    Number of blocks type     Unmovable      Movable  Reclaimable          CMA   HighAtomic      Isolate 
    Node 0, zone      DMA          123          310           18           61            0            0 
    Node 0, zone   Normal          406          310           43            9            0            0 
    Node 0, zone  Movable            0          256            0            0            0            0 
    
    Number of mixed blocks    Unmovable      Movable  Reclaimable          CMA   HighAtomic      Isolate 
    Node 0, zone      DMA            0           61            0            0            0            0 
    Node 0, zone   Normal            0           11            3            0            0            0 
    Node 0, zone  Movable            0            0            0            0            0            0 
    

    可以看到在Movable Zone中不存在Unmovable类型的页面,只有Movable类型的页面。

    管理区ZONE_MOVABLE

    这个管理区,主要是和memory hotplug功能有关,为什么要设计内存热插拔功能,主要是为了如下两点考虑:
    1.逻辑内存热插拔,对于虚拟机的支持,对于虚拟机按照需求来分配可用内存
    2.物理内存热插拔,对于NUMA服务器的支持,不需要的内存就设置为offline,以降低功耗
    3.优化内存碎片问题

    这个管理区域存放的page都是可迁移的,只能被带有__GFP_HIGHMEM和__GFP_MOVABLE标志的内存申请所使用,比如:

    #define GFP_HIGHUSER_MOVABLE    (GFP_HIGHUSER | __GFP_MOVABLE)
    
    #define GFP_USER    (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
    #define GFP_HIGHUSER    (GFP_USER | __GFP_HIGHMEM)
    
    

    主要注意的是不要把分配标志__GFP_MOVABLE和管理区ZONE_MOVABLE混淆,两者并不是对应的关系。

    • __GFP_MOVABLE表示的是一种分配页面属性,表示页面可迁移,即使不在ZONE_MOVABLE管理区,有些页面也是可以迁移的,比如cache;
    • ZONE_MOVABLE表示的是管理区,和内存的热插拔有关,当然其中的页面必须要可迁移才能支持热插拔。

    分配标志__GFP_MOVABLE

    #define __GFP_DMA   ((__force gfp_t)___GFP_DMA)
    #define __GFP_HIGHMEM   ((__force gfp_t)___GFP_HIGHMEM)
    #define __GFP_DMA32 ((__force gfp_t)___GFP_DMA32)
    #define __GFP_MOVABLE   ((__force gfp_t)___GFP_MOVABLE)  /* Page is movable */
    #define GFP_ZONEMASK    (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
    

    这几个分配标志被称为Zone modifiers,他们用来标识优先从哪个zone分配内存。

    bit       result
    =================
    0x0    => NORMAL
    0x1    => DMA or NORMAL
    0x2    => HIGHMEM or NORMAL
    0x3    => BAD (DMA+HIGHMEM)
    0x4    => DMA32 or DMA or NORMAL
    0x5    => BAD (DMA+DMA32)
    0x6    => BAD (HIGHMEM+DMA32)
    0x7    => BAD (HIGHMEM+DMA32+DMA)
    0x8    => NORMAL (MOVABLE+0)
    0x9    => DMA or NORMAL (MOVABLE+DMA)
    0xa    => MOVABLE (Movable is valid only if HIGHMEM is set too)
    0xb    => BAD (MOVABLE+HIGHMEM+DMA)
    0xc    => DMA32 (MOVABLE+DMA32)
    0xd    => BAD (MOVABLE+DMA32+DMA)
    0xe    => BAD (MOVABLE+DMA32+HIGHMEM)
    0xf    => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
    

    一共有4个bit用来表示组合类型,其中低3个bit只能选择一个(__GFP_DMA/__GFP_HIGHMEM/__GFP_DMA32),而__GFP_MOVABLE可以和其他三种的任何一个组合使用,因此一共有16中组合,根据各种类型进行一个偏移存放到一个long类型table中。

    GFP_ZONE_TABLE:
    
    |BAD|BAD|BAD|DMA32|BAD|MOVABLE|......|NORMAL|
    
    

    这些结果会根据上面的bit组合值做一个偏移,存放到ZONE TABLE中,从而可以根据组合快速定位要使用的ZONE管理区。由上可见,__GFP_MOVABLE代表的是一种分配策略,并不是和ZONE_MOVABLE匹配的,上一节也做了介绍,必须是(__GFP_HIGHMEM和__GFP_MOVABLE)同时置位才会从ZONE_MOVABLE管理区去分配内存。

    The zone fallback order is MOVABLE=>HIGHMEM=>NORMAL=>DMA32=>DMA
    

    因此我们分配内存时并不一定就会按照传入的FLAG来进行分配,如果对应zone中没有符合要求的内存,那么会依次进行fallback查找符合要求的内存。

    如何使能ZONE_MOVABLE

    - For all memory hotplug
        Memory model -> Sparse Memory  (CONFIG_SPARSEMEM)
        Allow for memory hot-add       (CONFIG_MEMORY_HOTPLUG)
    
    - To enable memory removal, the followings are also necessary
        Allow for memory hot remove    (CONFIG_MEMORY_HOTREMOVE)
        Page Migration                 (CONFIG_MIGRATION)
    
    - For ACPI memory hotplug, the followings are also necessary
        Memory hotplug (under ACPI Support menu) (CONFIG_ACPI_HOTPLUG_MEMORY)
        This option can be kernel module.
    
    - As a related configuration, if your box has a feature of NUMA-node hotplug
      via ACPI, then this option is necessary too.
        ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu)
        (CONFIG_ACPI_CONTAINER).
        This option can be kernel module too.
    
    1) When kernelcore=YYYY boot option is used,
       Size of memory not for movable pages (not for offline) is YYYY.
       Size of memory for movable pages (for offline) is TOTAL-YYYY.
    
    2) When movablecore=ZZZZ boot option is used,
       Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ.
       Size of memory for movable pages (for offline) is ZZZZ.
    

    内核中定义了sysfs节点用来控制内存的热插拔:

    % echo online > /sys/devices/system/memory/memoryXXX/state
    

    使能内存。

    % echo online_movable > /sys/devices/system/memory/memoryXXX/state
    

    切换内存管理区为ZONE_MOVABLE。

    % echo online_kernel > /sys/devices/system/memory/memoryXXX/state
    

    切换内存管理区为ZONE_NORMAL。

    如何决定MOVABLE_ZONE的大小

    我们先来看下在memory zone初始化时的处理:
    对于NUMA使能的系统处理是这样的:

    zone_sizes_init->free_area_init_nodes->find_zone_movable_pfns_for_nodes:
    /*
     * If movable_node is specified, ignore kernelcore and movablecore
     * options.
     */
    if (movable_node_is_enabled()) {
        for_each_memblock(memory, r) {
            if (!memblock_is_hotpluggable(r))
                continue;
    
            nid = r->nid;
    
            usable_startpfn = PFN_DOWN(r->base);
            zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
                min(usable_startpfn, zone_movable_pfn[nid]) :
                usable_startpfn;
        }
    
        goto out2;
    }
    
    

    当我们在dts设备树中配置对应的property时就会配置对应的memblock flag:

    int __init early_init_dt_scan_memory(unsigned long node, const char *uname,
                         int depth, void *data)
    {
       bool hotpluggable;
       hotpluggable = of_get_flat_dt_prop(node, "hotpluggable", NULL);
       while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) {
         u64 base, size;
    
         base = dt_mem_next_cell(dt_root_addr_cells, &reg);
         size = dt_mem_next_cell(dt_root_size_cells, &reg);
    
         if (size == 0)
             continue;
         pr_debug(" - %llx ,  %llx
    ", (unsigned long long)base,
             (unsigned long long)size);
    
         early_init_dt_add_memory_arch(base, size);
    
         if (!hotpluggable)
             continue;
    
         if (early_init_dt_mark_hotplug_memory_arch(base, size))
             pr_warn("failed to mark hotplug range 0x%llx - 0x%llx
    ",
                 base, base + size);
        }
    
    }
    
    int __init __weak early_init_dt_mark_hotplug_memory_arch(u64 base, u64 size)
    {
        return memblock_mark_hotplug(base, size);
    }
    
    int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
    {
        return memblock_setclr_flag(base, size, 1, MEMBLOCK_HOTPLUG);
    }  

    from: https://blog.csdn.net/rikeyone/article/details/86498298
  • 相关阅读:
    USACO Milk2 区间合并
    Codeforces 490B Queue【模拟】
    HDU 3974 Assign the task 简单搜索
    HDU 5119 Happy Matt Friends(2014北京区域赛现场赛H题 裸背包DP)
    Cin、Cout 加快效率方法
    POJ 1159 回文LCS滚动数组优化
    POJ 2479 不相交最大子段和
    POJ 1458 最长公共子序列 LCS
    在阿里最深刻的,还是职场之道给我的震撼
    精细化
  • 原文地址:https://www.cnblogs.com/aspirs/p/12781693.html
Copyright © 2011-2022 走看看