zoukankan      html  css  js  c++  java
  • Cpusets学习

    1. cpusets

    1.1 什么是cpusets

    cpusets基本功能是限制某一组进程只运行在某些cpu和内存节点上,举个简单例子:系统中有4个进程,4个内存节点,4个cpu.利用cpuset可以让第1,2个进程只运行在第1,2颗cpu上并且只在第1,2个内存节点上分配内存。cpuset是基于cgroup子系统实现(关于cgroup子系统可以参考内核文档 Documentation/cgroups/cgroups.txt.)使用cpuset上述功能可以让系统管理员动态调整进程运行所在的cpu和内存节点。

    cpusets是cgroup文件系统中的一个子系统。

    1.2 为什么需要cpusets

    在大型的计算机系统中,有多颗cpu,若干内存节点。尤其在NUMA架构下,cpu访问不同内存节点的速度不同,这种情况增加了进程调度和进程内存分配目标node管理的难度。比较新的小型系统使用linux内核自带的调度功能和内存管理方案就能得到很好表现,但是在比较大的系统中如果精心调整不同应用所在的cpu和内存节点会大大提高性能表现。 

    (NUMA架构)

    cpuset在以下场景会更有价值:

    1. 对于跑了很多相同的应用实例的大型web server
    2. 对于跑了不同应用的大型server(例如:同时跑了web server相关应用,又跑了数据库应用)
    3. 大型NUMA系统

    cpuset必须允许动态调整,并且不影其他不相关cpuset中运行的进程。比如:可以将某个cpu动态的加入到某个cpuset,可以从某个cpuset中将某个cpu移除,可以将进程加入到cpuset,也可以将某个进程从一个cpuset迁移到另一个cpuset。内核的cpuset补丁,提供了最基本的实现上述功能的机制,在实现上最大限度使用原有的cpu和内存节点分配机制,尽可能避免影响现有的调度器,以及内存分配核心功能的代码。

    1.3 cpusets是如何实现的

    cpusets整体为层级树结构。由一个根节点(root)包含了系统所有的cpu和内存节点资源。由根节点,可以分支出一个或者多个子节点,同时子节点也可以在分支出孙子节点,以此类推。每个子节点所包含的资源都是父节点资源的子集。

    为了理解层级树的概念:其实cpuset做为cgroup一个子系统实现,也遵循了cgroup层级树的概念。举个例子:有一个cpuset /,可以在cpuset /下再建立若干个cpuset。比如:建立background,foreground两个子cpuset。background和foreground互为兄弟,cpuset /属于父。子cpuset中的cpu和node节点集合必须是父亲cpuset中cpu和内存节点集合的子集。

    对cpuset的操作都是通过cpuset文件系统来完成的,内核没有提供额外的系统调用对cpuset做修改、查看操作。在android下,cpusets的文件系统路径:

    /dev/cpuset

    在/proc/#pid/status中如下几行也可以说明一个进程运行在哪些cpu上,并且进程分配内存必须在哪些内存节点上:

    Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
    Cpus_allowed_list:      0-127
    Mems_allowed:   ffffffff,ffffffff
    Mems_allowed_list:      0-63

    例如(android平台只有8个cpu core,1个内存节点):

    Cpus_allowed:    ff
    Cpus_allowed_list:    0-7
    Mems_allowed:    1
    Mems_allowed_list:    0

    每一个cpuset对应cgroup文件系统中的子系统,,目录下有些文件,用来描述cpuset的属性。对应的文件里列表如下:

    1.cpuset.cpus   //cpuset中的cpu列表
    2.cpuset.mems   //cpuset中的内存节点列表
    3.cpuset.memory_migrate  //cpuset内存迁移,见1.9
    4.cpuset.cpu_exclusive   //cpuset是否是cpu互斥的,见1.4
    5.cpuset.mem_exclusive   //cpuset是否是内存互斥的,见1.4
    6.cpuset.mem_hardwall    //cpuset是否是hardwalled的,见1.4
    7.cpuset.memory_pressure    //内存使用的紧张程度,见1.5
    8.cpuset.memory_spread_page   //如果被设置了,将该cpuset中进程上下文申请的page cache平均分布到cpuset中的各个节点,见1.6
    9.cpuset.memory_spread_slab   //如果被设置了,将该cpuset中进程上下文申请到的slab对象平均分布到cpuset中的各个内存节点,见1.6
    10.cpuset.sched_load_balance  //如果被设置了,负载均衡会在cpuset配置的cpus中进行,见1.7
    11.cpuset.sched_relax_domain_level  //当要task迁移时,搜索的范围,见1.8
    
    下面文件只有在根cpuset中才有:
    12.cpuset.memory_pressure_enabled  //使能memory pressure测量的flag

    其实在cpusets之前,已经有一套机制来限制某个进程只能被调度到某些cpu上运行(sched_setaffinity),限制某些进程的内存申请只能在某些内存节点上分配(mbind,set_mempolicy)。

    而cpusets进行了扩展:

    1. cpusets是cpu和memory节点的集合,并且对kernel可见的。
    2. 每个task struct中有一个指针指向了cgroup数据结构(cpuset是cgroup的一个子系统),通过这个指针,将进程添加到具体的cpuset中。
    3. 调用sched_setaffinity、mbind/set_mempolicy对应的cpu必须在对应task的cpuset中
    4. 根节点的cpuset包含了所有cpu和memory节点
    5. 对任意cpuset,都可以再定义其子cpuset,子cpuset中包含的cpu和内存节点是父cpuset的子集
    6. cpusets的层级结构可以mount到/dev/cpuset,user space可以通过其进行查看和操作
    7. 如果一个cpuset被标记为专有,则该cpuset的兄弟cpuset中包含的cpu和内存节点不能和它的cpu和memory节点有交集
    8. 可以查看到任一cpuset上所有task的pid

    cpusets的实现,需要在现有kernel中添加一些hook函数,而这些hook函数不会添加到kernel关键热点路径上(不影响性能):

      1.3.1 在init/main.c中,当系统boot时,初始化根节点的cpuset:

      start_kernel()--> cpuset_init()

      start_kernel()--> rest_init()--> kernel_thread(kernel_init, NULL, CLONE_FS) --> kernel_init() --> kernel_init_freeable() --> do_basic_setup() --> cpuset_init_smp()

    struct cpuset {
        struct cgroup_subsys_state css;
    
        unsigned long flags;        /* "unsigned long" so bitops work */
    
        /*
         * On default hierarchy:
         *
         * The user-configured masks can only be changed by writing to
         * cpuset.cpus and cpuset.mems, and won't be limited by the
         * parent masks.
         *
         * The effective masks is the real masks that apply to the tasks
         * in the cpuset. They may be changed if the configured masks are
         * changed or hotplug happens.
         *
         * effective_mask == configured_mask & parent's effective_mask,
         * and if it ends up empty, it will inherit the parent's mask.
         *
         *
         * On legacy hierachy:
         *
         * The user-configured masks are always the same with effective masks.
         */
    
        /* user-configured CPUs and Memory Nodes allow to tasks */
        cpumask_var_t cpus_allowed;
        cpumask_var_t cpus_requested;
        nodemask_t mems_allowed;
    
        /* effective CPUs and Memory Nodes allow to tasks */
        cpumask_var_t effective_cpus;
        nodemask_t effective_mems;
    
        /*
         * This is old Memory Nodes tasks took on.
         *
         * - top_cpuset.old_mems_allowed is initialized to mems_allowed.
         * - A new cpuset's old_mems_allowed is initialized when some
         *   task is moved into it.
         * - old_mems_allowed is used in cpuset_migrate_mm() when we change
         *   cpuset.mems_allowed and have tasks' nodemask updated, and
         *   then old_mems_allowed is updated to mems_allowed.
         */
        nodemask_t old_mems_allowed;
    
        struct fmeter fmeter;        /* memory_pressure filter */
    
        /*
         * Tasks are being attached to this cpuset.  Used to prevent
         * zeroing cpus/mems_allowed between ->can_attach() and ->attach().
         */
        int attach_in_progress;
    
        /* partition number for rebuild_sched_domains() */
        int pn;
    
        /* for custom sched domain */
        int relax_domain_level;
    };
    int __init cpuset_init(void)
    {
        int err = 0;
    
        BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL));
        BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL));
        BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_requested, GFP_KERNEL));
    
        cpumask_setall(top_cpuset.cpus_allowed);
        cpumask_setall(top_cpuset.cpus_requested);
        nodes_setall(top_cpuset.mems_allowed);
        cpumask_setall(top_cpuset.effective_cpus);
        nodes_setall(top_cpuset.effective_mems);
    
        fmeter_init(&top_cpuset.fmeter);
        set_bit(CS_SCHED_LOAD_BALANCE, &top_cpuset.flags);
        top_cpuset.relax_domain_level = -1;            //以上这些都是在初始化根节点cpuset的参数
    
        err = register_filesystem(&cpuset_fs_type);       //注册文件系统
        if (err < 0)
            return err;
    
        BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
    
        return 0;
    }
    /**
     * cpuset_init_smp - initialize cpus_allowed
     *
     * Description: Finish top cpuset after cpu, node maps are initialized
     */
    void __init cpuset_init_smp(void)
    {
        cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
        top_cpuset.mems_allowed = node_states[N_MEMORY];                //node_states[N_MEMORY]是存放了所有online的内存节点,当内存hotplug时,会发生变化
        top_cpuset.old_mems_allowed = top_cpuset.mems_allowed;
    
        cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask);
        top_cpuset.effective_mems = node_states[N_MEMORY];               //在cpu、memory表初始化后,完成cpuset剩余参数的初始化
    
        register_hotmemory_notifier(&cpuset_track_online_nodes_nb);          //注册了一个notify,当cpuset中cpu或者memory发现改变了(hotplug),就会工作
    
        cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0);  //创建了一个工作队列
        BUG_ON(!cpuset_migrate_mm_wq);
    }

      1.3.2 在进程fork/exit时,会从对应的cpuset中执行attach/detach:

      fork调用路径1:_do_fork() -> copy_process() -> cgroup_fork()

    /**
     * cgroup_fork - initialize cgroup related fields during copy_process()
     * @child: pointer to task_struct of forking parent process.
     *
     * A task is associated with the init_css_set until cgroup_post_fork()
     * attaches it to the parent's css_set.  Empty cg_list indicates that
    * @child isn't holding reference to its css_set.
     */
    void cgroup_fork(struct task_struct *child)
    {
        RCU_INIT_POINTER(child->cgroups, &init_css_set);    //初始化子进程的css_set(cgroup subsystem set)
        INIT_LIST_HEAD(&child->cg_list);
    }

      fork调用路径2:_do_fork() -> copy_process() -> cgroup_post_fork() -> css_set_move_task() -> cgroup_move_task() -> rcu_assign_pointer(task->cgroups, to);

            其中,cgroup_move_task()中,会将子进程移动到父进程的cgroup中,操作由css_set指针赋值完成:

            to指针为父进程的css_set指针,而task->cgroups则是child的css_set指针,最后通过指针赋值,子进程将attach到父进程的cgroup set中。

      fork调用路径3:_do_fork() -> copy_process() -> cgroup_post_fork() -> 调用每个cgroup子系统的fork(): ss->fork(child) -> cpuset_fork()

    /*
     * Make sure the new task conform to the current state of its parent,
     * which could have been changed by cpuset just after it inherits the
     * state from the parent and before it sits on the cgroup's task list.
     */
    static void cpuset_fork(struct task_struct *task)  
    {
        if (task_css_is_root(task, cpuset_cgrp_id))
            return;
    
        set_cpus_allowed_ptr(task, &current->cpus_allowed);
        task->mems_allowed = current->mems_allowed;    //继承父进程current的mem_allowed
    }

             接着:cpuset_fork() ->set_cpus_allowed_ptr() -> __set_cpus_allowed_ptr()如下 -> do_set_cpus_allowed() -> p->sched_class->set_cpus_allowed(p, new_mask);

    /*
     * Change a given task's CPU affinity. Migrate the thread to a
     * proper CPU and schedule it away if the CPU it's executing on
     * is removed from the allowed bitmask.
     *
     * NOTE: the caller must have a valid reference to the task, the
     * task must not exit() & deallocate itself prematurely. The
     * call is not atomic; no spinlocks may be held.
     */
    static int __set_cpus_allowed_ptr(struct task_struct *p,
                      const struct cpumask *new_mask, bool check)
    {
        const struct cpumask *cpu_valid_mask = cpu_active_mask;
        unsigned int dest_cpu;
        struct rq_flags rf;
        struct rq *rq;
        int ret = 0;
        cpumask_t allowed_mask;
    
        rq = task_rq_lock(p, &rf);
        update_rq_clock(rq);
    
        if (p->flags & PF_KTHREAD) {        //所有kernel进程默认可以运行在所有online的cpu上
            /*
             * Kernel threads are allowed on online && !active CPUs
             */
            cpu_valid_mask = cpu_online_mask;
        }
    
        /*
         * Must re-check here, to close a race against __kthread_bind(),
         * sched_setaffinity() is not guaranteed to observe the flag.
         */
        if (check && (p->flags & PF_NO_SETAFFINITY)) {  //thread不允许改变cpu亲和度
            ret = -EINVAL;
            goto out;
        }
    
        if (cpumask_equal(&p->cpus_allowed, new_mask))  //当前可运行的cpu和要设置的cpu相等,那就不需要重复设置
            goto out;
    
        cpumask_andnot(&allowed_mask, new_mask, cpu_isolated_mask);  //将父进程的cpus_allowed中去掉isolate的cpu
        cpumask_and(&allowed_mask, &allowed_mask, cpu_valid_mask);   //再从中筛选可以进行task迁移的cpu
    
        dest_cpu = cpumask_any(&allowed_mask);            //最后再筛选出的结果中挑选一个dest_cpu
        if (dest_cpu >= nr_cpu_ids) {                  //如果dest_cpu的index超过最大cpu的index,则需要重新挑选
            cpumask_and(&allowed_mask, cpu_valid_mask, new_mask);
            dest_cpu = cpumask_any(&allowed_mask);
            if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
                ret = -EINVAL;
                goto out;
            }
        }
    
        do_set_cpus_allowed(p, new_mask);      //详细见下
    
        if (p->flags & PF_KTHREAD) {
            /*
             * For kernel threads that do indeed end up on online &&
             * !active we want to ensure they are strict per-CPU threads.
             */
            WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&
                !cpumask_intersects(new_mask, cpu_active_mask) &&
                p->nr_cpus_allowed != 1);
        }
    
        /* Can the task run on the task's current CPU? If so, we're done */
        if (cpumask_test_cpu(task_cpu(p), &allowed_mask))      //如果task能运行在task本来运行的cpu上,则直接退出
            goto out;
    
        if (task_running(rq, p) || p->state == TASK_WAKING) {  //判断task是否处于running或者waking状态(不同task状态,不同的迁移方式)
            struct migration_arg arg = { p, dest_cpu };
            /* Need help from migration thread: drop lock and wait. */
            task_rq_unlock(rq, p, &rf);
            stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);    //在当前cpu上执行migration_cpu_stop函数,arg为函数的参数,执行完后stop当前cpu
            tlb_migrate_finish(p->mm);
            return 0;
        } else if (task_on_rq_queued(p)) {            //task是否在rq中
            /*
             * OK, since we're going to drop the lock immediately
             * afterwards anyway.
             */
            rq = move_queued_task(rq, &rf, p, dest_cpu);  //把task迁移到dest_cpu上
        }
    out:
        task_rq_unlock(rq, p, &rf);
    
        return ret;
    }

             调用至对应sched class的set_cpu_allowed(除deadline class,其余class都调用的是set_cpus_allowed_common()),如下:

    void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask)
    {
        cpumask_copy(&p->cpus_allowed, new_mask);    //继承父进程的cpus_allowed
        p->nr_cpus_allowed = cpumask_weight(new_mask);  //计算子进程的cpus_allowed数量
    }

            

    /*
     * migration_cpu_stop - this will be executed by a highprio stopper thread
     * and performs thread migration by bumping thread off CPU then
     * 'pushing' onto another runqueue.
     */
    static int migration_cpu_stop(void *data)
    {
        struct migration_arg *arg = data;
        struct task_struct *p = arg->task;
        struct rq *rq = this_rq();
        struct rq_flags rf;
    
        /*
         * The original target CPU might have gone down and we might
         * be on another CPU but it doesn't matter.
         */
        local_irq_disable();
        /*
         * We need to explicitly wake pending tasks before running
         * __migrate_task() such that we will not miss enforcing cpus_allowed
         * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.
         */
        sched_ttwu_pending();
    
        raw_spin_lock(&p->pi_lock);
        rq_lock(rq, &rf);
        /*
         * If task_rq(p) != rq, it cannot be migrated here, because we're
         * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because
         * we're holding p->pi_lock.
         */
        if (task_rq(p) == rq) {
            if (task_on_rq_queued(p))
                rq = __migrate_task(rq, &rf, p, arg->dest_cpu);  //迁移到dest_cpu
            else
                p->wake_cpu = arg->dest_cpu;
        }
        rq_unlock(rq, &rf);
        raw_spin_unlock(&p->pi_lock);
    
        local_irq_enable();
        return 0;
    }

             然后,在_do_fork() -> wake_up_new_task() -> select_task_rq()中,选择满足cpus_allowed的cpu来执行子进程:

    /*
     * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
     */
    static inline
    int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
               int sibling_count_hint)
    {
        bool allow_isolated = (p->flags & PF_KTHREAD);
    
        lockdep_assert_held(&p->pi_lock);
    
        if (p->nr_cpus_allowed > 1)
            cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,  //通过调用对应sched class的select_task_rq来选择执行子进程的cpu
                                 sibling_count_hint);
        else
            cpu = cpumask_any(&p->cpus_allowed);
    
        /*
         * In order not to call set_task_cpu() on a blocking task we need
         * to rely on ttwu() to place the task on a valid ->cpus_allowed
         * CPU.
         *
         * Since this is common to all placement strategies, this lives here.
         *
         * [ this allows ->select_task() to simply return task_cpu(p) and
         *   not worry about this generic constraint ]
         */
        if (unlikely(!is_cpu_allowed(p, cpu)) ||
                (cpu_isolated(cpu) && !allow_isolated))
            cpu = select_fallback_rq(task_cpu(p), p, allow_isolated);    //筛选执行cpu满足cpu_allowed
    
        return cpu;
    }

      1.3.3 在设置cpu亲和度(sched_setaffinity)时,会过滤cpuset中配置:

      当执行sched_setaffinity设置cpu亲和度的调用路径:sched_setaffinity() -> get_user_cpu_mask()

                              sched_setaffinity() -> sched_setaffinity()

      首先在get_user_cpu_mask()中,从user space获取要设置的cpu mask;

      之后在sched_setaffinity()在再进行详细的过滤,从而设置准确的task cpu mask,并挑选出合适的cpu执行task

    long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
    {
        cpumask_var_t cpus_allowed, new_mask;
        struct task_struct *p;
        int retval;
        int dest_cpu;
        cpumask_t allowed_mask;
    
        rcu_read_lock();
    
        p = find_process_by_pid(pid);
        if (!p) {
            rcu_read_unlock();
            return -ESRCH;
        }
    
        /* Prevent p going away */
        get_task_struct(p);
        rcu_read_unlock();
    
        if (p->flags & PF_NO_SETAFFINITY) {
            retval = -EINVAL;
            goto out_put_task;
        }
        if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
            retval = -ENOMEM;
            goto out_put_task;
        }
        if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {
            retval = -ENOMEM;
            goto out_free_cpus_allowed;
        }
        retval = -EPERM;
        if (!check_same_owner(p)) {
            rcu_read_lock();
            if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
                rcu_read_unlock();
                goto out_free_new_mask;
            }
            rcu_read_unlock();
        }
    
        retval = security_task_setscheduler(p);
        if (retval)
            goto out_free_new_mask;
    
    
        cpuset_cpus_allowed(p, cpus_allowed);        //获取task cpuset中allowed_cpu
        cpumask_and(new_mask, in_mask, cpus_allowed);   //将通过user space中获取的cpu mask与task的cpuset进行过滤,找出能同时满足条件的cpu(new_mask)
    
        /*
         * Since bandwidth control happens on root_domain basis,
         * if admission test is enabled, we only admit -deadline
         * tasks allowed to run on all the CPUs in the task's
         * root_domain.
         */
    #ifdef CONFIG_SMP
        if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
            rcu_read_lock();
            if (!cpumask_subset(task_rq(p)->rd->span, new_mask)) {
                retval = -EBUSY;
                rcu_read_unlock();
                goto out_free_new_mask;
            }
            rcu_read_unlock();
        }
    #endif
    again:
        cpumask_andnot(&allowed_mask, new_mask, cpu_isolated_mask);  //从上面得到的new_mask中,过滤掉isolate的cpu,结果保存在allowed_mask
        dest_cpu = cpumask_any_and(cpu_active_mask, &allowed_mask);  //从cpu_active_mask(处于active状态的cpu:可以进行task迁移)和allowed_mask中找到满足条件的dest_cpu
        if (dest_cpu < nr_cpu_ids) {                    //如果dest_cpu的index小于当前cpu核最大的index
            retval = __set_cpus_allowed_ptr(p, new_mask, true);    //与fork的路径类似,为设置new_mask为task新的cpus_allowed,并将task执行在dest_cpu上
            if (!retval) {
                cpuset_cpus_allowed(p, cpus_allowed);          //再重新获取task的cpus_allowed
                if (!cpumask_subset(new_mask, cpus_allowed)) {     //判断new_mask是否是cpus_allowed的子集,如果不是,则说明cpuset又被修改了。则需要更新task的cpus_allowed
                    /*
                     * We must have raced with a concurrent cpuset
                     * update. Just reset the cpus_allowed to the
                     * cpuset's cpus_allowed
                     */
                    cpumask_copy(new_mask, cpus_allowed);        //赋值new_mask = cpus_allowed
                    goto again;                        //重复上面步骤
                }
            }
        } else {
            retval = -EINVAL;
        }
    
        if (!retval && !(p->flags & PF_KTHREAD))              //PF_KTHREAD代表是否是kernel space创建的进程
            cpumask_and(&p->cpus_requested, in_mask, cpu_possible_mask);  //将user space设置的cpu mask与cpu_possible_mask(平台支持hotplug,所以它包含所有cpu核)
                                               //取同时满足上面2个条件的cpu,存放到task->cpus_requested中
    out_free_new_mask:
        free_cpumask_var(new_mask);
    out_free_cpus_allowed:
        free_cpumask_var(cpus_allowed);
    out_put_task:
        put_task_struct(p);
        return retval;
    }

      1.3.4 在task迁移时,会遵循cpuset的配置:

       task迁移本质上就是将此cpu rq中的task,放到另一个cpu的rq上。在上面讲到的fork中,子进程就会在创建之后,进行task迁移,当然同时会遵循cpuset的配置;除了fork,还有更一般的调用_migrate_task的进行task迁移的情况,同样也会对cpuset配置进行过滤。

    /*
     * Move (not current) task off this CPU, onto the destination CPU. We're doing
     * this because either it can't run here any more (set_cpus_allowed()
     * away from this CPU, or CPU going down), or because we're
     * attempting to rebalance this task on exec (sched_exec).
     *
     * So we race with normal scheduler movements, but that's OK, as long
     * as the task is no longer on this CPU.
     */
    static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
                     struct task_struct *p, int dest_cpu)
    {
        /* Affinity changed (again). */
        if (!is_cpu_allowed(p, dest_cpu))
            return rq;
    
        update_rq_clock(rq);
        rq = move_queued_task(rq, rf, p, dest_cpu);  //迁移task到dest_cpu
    
        return rq;
    }

      可以看到有调用is_cpu_allowed()来检测dest_cpu是否在task的cpus_allowed中,并根据是否是内核线程,进行进一步筛选:

    /*
     * Per-CPU kthreads are allowed to run on !actie && online CPUs, see
     * __set_cpus_allowed_ptr() and select_fallback_rq().
     */
    static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
    {
        if (!cpumask_test_cpu(cpu, &p->cpus_allowed))  //检测dest_cpu是否allow。如果不allow,则return false,且不满足迁移条件
            return false;
    
        if (is_per_cpu_kthread(p))     //检测task是否是kernel thread,内核进程默认是可以运行在所有online的cpu上
            return cpu_online(cpu);    //cpu是否online(available to scheduler,可以调度运行)
    
        return cpu_active(cpu);      //cpu是否active(available to migration,可以迁移)
    }

      1.3.5 在系统调用:mbind、set_mempolicy时,会过滤cpuset中的内存节点:

      mbind调用路径:kernel_mbind() -> do_mbind() -> mpol_set_nodemask() 

      set_mempolicy调用路径:kernel_set_mempolicy() -> do_set_mempolicy() -> mpol_set_nodemask()

      函数是在创建新mem policy之后,set up内存node mask

    /*
     * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if
     * any, for the new policy.  mpol_new() has already validated the nodes
     * parameter with respect to the policy mode and flags.  But, we need to
     * handle an empty nodemask with MPOL_PREFERRED here.
     *
     * Must be called holding task's alloc_lock to protect task's mems_allowed
     * and mempolicy.  May also be called holding the mmap_semaphore for write.
     */
    static int mpol_set_nodemask(struct mempolicy *pol,
                 const nodemask_t *nodes, struct nodemask_scratch *nsc)
    {
        int ret;
    
        /* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
        if (pol == NULL)
            return 0;
        /* Check N_MEMORY */
        nodes_and(nsc->mask1,
              cpuset_current_mems_allowed, node_states[N_MEMORY]);  //获取当前task所允许的内存节点,并与online的节点过滤,找出同时满足条件的节点,存到nsc->mask1中
    
        VM_BUG_ON(!nodes);
        if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
            nodes = NULL;    /* explicit local allocation */
        else {
            if (pol->flags & MPOL_F_RELATIVE_NODES)                //根据flag不同,将上面得到的nsc->mask1,再次通过不同计算,得到不同的node mask存到nsc->mask2
                mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
            else
                nodes_and(nsc->mask2, *nodes, nsc->mask1);
    
            if (mpol_store_user_nodemask(pol))            //根据flag不同,保存不同的node mask结果
                pol->w.user_nodemask = *nodes;            //保存user space传入的内存节点
            else
                pol->w.cpuset_mems_allowed =
                            cpuset_current_mems_allowed;      //保存task允许的内存节点
        }
    
        if (nodes)
            ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);  //根据不同的policy,调用不同的create方式,allocate内存节点
        else
            ret = mpol_ops[pol->mode].create(pol, NULL);
        return ret;
    }

      1.3.6 在page_alloc.c中,会限制申请内存节点:

        kmalloc() -> kmalloc_large() -> kmalloc_order_trace() -> kmalloc_order() --> alloc_pages() -> alloc_pages_node() -> __alloc_pages_node() -> __alloc_pages()  ->__alloc_pages_nodemask() -> get_page_from_freelist() -> __cpuset_zone_allowed()-> __cpuset_node_allowed()

       这个函数用于判断是否可以申请一个内存节点。逻辑判断顺序如下:

    •   中断中申请内存:      ✔
    •   申请的节点在task mem_allowed内:✔
    •   OOM被杀的进程:       ✔
    •   user space申请(user space申请都会设置__GFP_HARDWALL):✖
    •   正在退出的进程:      ✔
    •   未设置__GFP_HARDWALL,但是节点在mems_allowed之外,的内核进程申请:需要往上层寻找带有内存互斥exclusive或者hardwall,并对当前节点有影响的上层节点,并判断当前的node是否在其允许范围内。如果在范围内,则允许申请。
    /**
     * cpuset_node_allowed - Can we allocate on a memory node?
     * @node: is this an allowed node?
     * @gfp_mask: memory allocation flags
     *
     * If we're in interrupt, yes, we can always allocate.  If @node is set in
     * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
     * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
     * yes.  If current has access to memory reserves as an oom victim, yes.
     * Otherwise, no.
     *
     * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
     * and do not allow allocations outside the current tasks cpuset
     * unless the task has been OOM killed.
     * GFP_KERNEL allocations are not so marked, so can escape to the
     * nearest enclosing hardwalled ancestor cpuset.
     *
     * Scanning up parent cpusets requires callback_lock.  The
     * __alloc_pages() routine only calls here with __GFP_HARDWALL bit
     * _not_ set if it's a GFP_KERNEL allocation, and all nodes in the
     * current tasks mems_allowed came up empty on the first pass over
     * the zonelist.  So only GFP_KERNEL allocations, if all nodes in the
     * cpuset are short of memory, might require taking the callback_lock.
     *
     * The first call here from mm/page_alloc:get_page_from_freelist()
     * has __GFP_HARDWALL set in gfp_mask, enforcing hardwall cpusets,
     * so no allocation on a node outside the cpuset is allowed (unless
     * in interrupt, of course).
     *
     * The second pass through get_page_from_freelist() doesn't even call
     * here for GFP_ATOMIC calls.  For those calls, the __alloc_pages()
     * variable 'wait' is not set, and the bit ALLOC_CPUSET is not set
     * in alloc_flags.  That logic and the checks below have the combined
     * affect that:
     *    in_interrupt - any node ok (current task context irrelevant)
     *    GFP_ATOMIC   - any node ok
     *    tsk_is_oom_victim   - any node ok
     *    GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
     *    GFP_USER     - only nodes in current tasks mems allowed ok.
     */
    bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
    {
        struct cpuset *cs;        /* current cpuset ancestors */
        int allowed;            /* is allocation in zone z allowed? */
        unsigned long flags;
    
        if (in_interrupt())                              //在中断中,永远允许内存申请
            return true;
        if (node_isset(node, current->mems_allowed))               //节点在task的允许范围内
            return true;
        /*
         * Allow tasks that have access to memory reserves because they have
         * been OOM killed to get memory anywhere.
         */
        if (unlikely(tsk_is_oom_victim(current)))                  //oom被kill掉的进程,允许申请
            return true;
        if (gfp_mask & __GFP_HARDWALL)    /* If hardwall request, stop here */  //如果是user space的申请request,并且设置了__GFP_HARDWALL,那么不允许
            return false;
    
        if (current->flags & PF_EXITING) /* Let dying task have memory */    //正在退出的进程,允许申请
            return true;
    
        /* Not hardwall and node outside mems_allowed: scan up cpusets */
        spin_lock_irqsave(&callback_lock, flags);
    
        rcu_read_lock();
        cs = nearest_hardwall_ancestor(task_cs(current));            //往上层寻找带有内存互斥exclusive或者hardwall,并对当前节点有影响的上层节点
        allowed = node_isset(node, cs->mems_allowed);               //判断当前的node是否在上层节点的允许范围内
        rcu_read_unlock();
    
        spin_unlock_irqrestore(&callback_lock, flags);
        return allowed;
    }

      1.3.7 在vmscan.c中,限制page恢复到当前cpuset:

       调用路径:shrink_zones() -> cpuset_zone_allowed() -> __cpuset_zone_allowed() ->  __cpuset_node_allowed()

       shrink_zones()函数是linux内存页回收的核心函数。而对于cpuset中内存节点的限制判断逻辑,与page_alloc.c中的完全一样。

    /*
     * This is the direct reclaim path, for page-allocating processes.  We only
     * try to reclaim pages from zones which will satisfy the caller's allocation
     * request.
     *
     * If a zone is deemed to be full of pinned pages then just give it a light
     * scan then give up on it.
     */
    static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
    {
        struct zoneref *z;
        struct zone *zone;
        unsigned long nr_soft_reclaimed;
        unsigned long nr_soft_scanned;
        gfp_t orig_mask;
        pg_data_t *last_pgdat = NULL;
    
        /*
         * If the number of buffer_heads in the machine exceeds the maximum
         * allowed level, force direct reclaim to scan the highmem zone as
         * highmem pages could be pinning lowmem pages storing buffer_heads
         */
        orig_mask = sc->gfp_mask;
        if (buffer_heads_over_limit) {
            sc->gfp_mask |= __GFP_HIGHMEM;
            sc->reclaim_idx = gfp_zone(sc->gfp_mask);
        }
    
        for_each_zone_zonelist_nodemask(zone, z, zonelist,
                        sc->reclaim_idx, sc->nodemask) {
            /*
             * Take care memory controller reclaiming has small influence
             * to global LRU.
             */
            if (global_reclaim(sc)) {
                if (!cpuset_zone_allowed(zone,
                             GFP_KERNEL | __GFP_HARDWALL))      //设置了__GFP_HARDWALL,所以只能是在task允许的范围内,或者在中断,或者是oom受害者,才可以进行内存回收
                    continue;
    
                /*
                 * If we already have plenty of memory free for
                 * compaction in this zone, don't free any more.
                 * Even though compaction is invoked for any
                 * non-zero order, only frequent costly order
                 * reclamation is disruptive enough to become a
                 * noticeable problem, like transparent huge
                 * page allocations.
                 */
                if (IS_ENABLED(CONFIG_COMPACTION) &&
                    sc->order > PAGE_ALLOC_COSTLY_ORDER &&
                    compaction_ready(zone, sc)) {
                    sc->compaction_ready = true;
                    continue;
                }
    
                /*
                 * Shrink each node in the zonelist once. If the
                 * zonelist is ordered by zone (not the default) then a
                 * node may be shrunk multiple times but in that case
                 * the user prefers lower zones being preserved.
                 */
                if (zone->zone_pgdat == last_pgdat)
                    continue;
    
                /*
                 * This steals pages from memory cgroups over softlimit
                 * and returns the number of reclaimed pages and
                 * scanned pages. This works for global memory pressure
                 * and balancing, not for a memcg's limit.
                 */
                nr_soft_scanned = 0;
                nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
                            sc->order, sc->gfp_mask,
                            &nr_soft_scanned);
                sc->nr_reclaimed += nr_soft_reclaimed;
                sc->nr_scanned += nr_soft_scanned;
                /* need some check for avoid more shrink_zone() */
            }
    
            /* See comment about same check for global reclaim above */
            if (zone->zone_pgdat == last_pgdat)
                continue;
            last_pgdat = zone->zone_pgdat;
            shrink_node(zone->zone_pgdat, sc);
        }
    
        /*
         * Restore to original mask to avoid the impact on the caller if we
         * promoted it to __GFP_HIGHMEM.
         */
        sc->gfp_mask = orig_mask;
    }

    init cpusets代码分析中,还要分析hotplug部分。。。。。。。。待续  

    1.4  什么是互斥cpuset?

    cpu/mem互斥

    cpuset中有两个bool类型的参数cpuset.mem_exclusive,cpuset.cpu_exclusive,分别表示当前的cpuset是否是内存互斥和cpu互斥的,自此同一层级的cpuset不能共享互斥的资源。

    我们在上面已经理解了层级树的概念,那么cpu互斥就是:如果某个cpuset是cpu互斥的,那么该cpuset中的cpu是不能被该cpuset的兄弟cpuset共享的,当然该cpuset中的cpu是能够和该cpuset的祖先cpuset和子孙cpuset共享的。

    注意:设置exclusive时,需要其父节点也为exclusive。否则会设置失败。所以,最终还是说明cpuset的互斥,就是从根节点开始,就已经互斥了。(当前android平台下,默认并没有设置cpu/内存互斥)

    内存互斥也是一样,只不过互斥的资源换成了内存节点。

    内存mem_hardwall

    cpuset还有一个bool类型的参数cpuset.mem_hardwall,如果cpuset被设置成内存互斥(mem_exclusive)或者被设置了cpuset.mem_hardwall,那么都认为这个cpuset是mem_hardwalled。

    mem_hardwalled的cpuset中的进程在内核态申请的内存(主要有直接申请的页,以及通过slab系统分配的内存,还有文件系统page cache)会受到cpuset的限制,进而在该cpuset内的内存节点中分配内存。对比:不管cpuset是否是mem_hardwalled,cpuset中的进程在用户态申请的内存都会受到cpuset中内存节点的限制。所以得出:而mem_exclusive除了提供了内存互斥的功能,也包含了cpuset.mem_hardwall的功能。

    通过上面,得到如下结论:

    1、cpuset.mem_hardwall限制的是cpuset中进程在内核态申请的内存(必须符合cpuset的内存限制)。而用户态申请内存,不管有没有设置mem_hardwall,都会受到cpuset限制。

    2、cpuset mem_hardwall=1,那么进程内核态的内存分配会cpuset中内存节点的限制;反之,内核态则不受cpuset中内存节点的限制。

    3、mem_exclusive包含了mem_hardwall的功能。

    1.5 memory_pressure

    memory_pressure提供了一个监控cpuset内存压力的方法。它会记录为了满足内存足够剩余空间的要求,释放cpuset中内存节点的次数。它是一个int型数值,单位为:每秒该cpuset尝试回收内存的次数*1000,半衰期为10秒。

    这个属性可以有助于监控和管理当前cpuset下,各个任务对内存的影响,从而进行进一步地内存节点调配,或任务分配。

    使能这个功能需要提前对节点:/dev/cpuset/memory_pressure_enabled,写入1之后,memory_pressure才会开始计算。

    数据的计算可以被任何一个attach到该cpuset上的进程执行。当进行调用页回收时,就会执行计算。因为在每组查询上避免对任务列表进行扫描,系统上减少了监视此度量标准的批处理调度程序所施加的系统负载。

     1.6 memory spread

    有2个flag在cpuset中用于控制kernel从哪里申请内存,并将之用于文件系统得buffer或者相关的kernel内核数据结构。它们分别是cpuset.memory_spread_page 和 cpuset.memory_spread_slab。对其,写入1,则使能;写入0,则关闭使能。

    如果cpuset.memory_spread_page被设置了,文件系统的buffer(page cache)将会允许从所有内存mem_allowed的节点中,而不是仅在task所在的节点中寻找free page。

    如果cpuset.memory_spread_slab被设置了,slab cache(例如inode,dentry)将会允许从所有内存mem_allowed的节点中,而不是仅在task所在的节点中寻找free page。

    默认情况下,flag是disable状态,内存申请会遵循NUMA mempolicy和cpuset的配置;而当打开了memory spread之后,会忽视NUMA mempolicy,并进行spread。一旦再次disable flag,则会继续遵循NUMA mempolicy。

    1.7 sched_load_balance 

    负载均衡(load balance)

    kernel调度器会自动进行均衡task的负载:一个cpu没有被充分利用,那么在此CPU上的kernel code就会去寻找overload的CPU,并把task移到空闲的CPU上。分配任务(Task Placement)遵循cpuset配置和cpu亲和度相关系统调用(sched_setaffinity)。

    调度域(sched domain)

    但是负载均衡算法的开销,以及对一些kernel共享的核心数据结构(比如task list)的影响,会随着执行负载均衡cpu核数的增加而成倍增加。所以调度器会将cpu分为几个调度域,而负载均衡仅会作用在每一个调度域之内。简而言之,在2个更小的调度域内进行负载均衡的开销,比在一个大的调度域内进行负载均衡的开销更小。但是在调度域之间就不会进行负载均衡了。不同调度域之间没有重叠部分,有的cpu可能不在任何调度域内,因为它不需要进行负载均衡。

    cpuset.sched_load_balance

    cpuset.sched_load_balance被设置之后,需要当前cpuset.cpus中的cpu是在同一个的调度域中,从而在内cpuset.cpus内进行负载均衡。

    cpuset.sched_load_balance被disable之后,就是停止cpuset.cpus内的负载均衡。但是如果cpuset.cpus与其他cpuset中配置有重叠,并且设置了cpuset.sched_load_balance,那么仍然会在其cpus中进行负载均衡,例如,在根节点的cpuset中设置cpuset.sched_load_balance,那么就会在在所有cpu中进行负载均衡。

    因此,在设置cpuset.sched_load_balance的时候,应该保持top层的节点不设置cpuset.sched_load_balance,而只在child层或者一些更小的cpuset中,进行设置。

    同时,由于cpuset是呈层级网状结构的;而cpuset.sched_load_balance是一维flat结构,它们不会出现重叠,每个cpu最多只会出现在一个调度域中。

    那么假如在2个cpuset中有出现cpus部分重叠,并且都设置了cpuset.sched_load_balance,那么我们应该重新构造一个单独的、包含这2个cpuset的扩展集合。因为我们遵循不将task移出当前cpuset中的规则,这样可以防止调度器产生多余的开销。

    假如在2个cpuset中有出现cpus部分重叠,并且其中之一设置了cpuset.sched_load_balance,那么就会出现负载均衡仅在重叠的cpu中进行。针对这样的情况,我们就不能把一些可能需要较多CPU资源的task放在这样的cpuset中。

    我们当前SDM845 Android平台,是在top cpuset下,已经打开fully负载均衡的。也就是一个调度域涵盖了整个系统,其他cpuset的设置已经不生效了。

    1.8 sched_relax_domain_level

    在调度域中,主要有2种触发迁移task的方式:1. tick周期性负载均衡  2. 出现某些调度events。

    当一个task被唤醒,就会被迁移到idle cpu上。例如:当task A在CPU X上运行,并唤醒了同在CPU X上的另一个task B。那么调度器就会把task B迁移到CPU Y(CPU Y为CPU X的兄弟)。这样task B可以直接在CPU Y上运行,避免了等待task A执行的情况。当CPU的task跑完了,在进入idle之前,会尝试从其他繁忙的CPU上迁移一些task,帮繁忙的CPU分担一些task。

    当然,寻找可迁移的task和需要迁移的CPU会产生一些开销,但调度器并不是每次都会在域中搜索所有的CPU。实际上,在一些架构下,因调度events触发时,搜索的范围会被限制在cpu所在的socket或者node节点;而在周期性负载均衡时,搜索范围则是所有cpu。例如:假设CPU Z与CPU X相对较远(非兄弟),那么尽管CPU Z处于idle,而CPU X以及其兄弟CPU都很繁忙,调度器还是不会将刚唤醒的task B迁移到CPU Z上,因为CPU Z不在搜索范围内。最后,要么task B等待task A执行结束后被执行,要么等到下次周期性负载均衡时进行迁移。

    cpuset.sched_relax_domain_level

    sched_relax_domain_level属性允许了用户对搜索范围的修改。它用int型的值来体现搜索范围的设置,如下:

    -1 : no request. use system default or follow request of others.
    0 : no search.
    1 : search siblings (hyperthreads in a core).
    2 : search cores in a package.
    3 : search cpus in a node [= system wide on non-NUMA system]
    4 : search nodes in a chunk of node [on NUMA system]
    5 : search system wide [on NUMA system]

    这个属性影响的是cpuset所在的调度域,所以sched_load_balance这个flag不能disable,因为disable了的话,就没有调度域了。

    如果有多个cpuset部分重叠,因此他们会形成一个单独的调度域,那么sched_relax_domain_level就会使用其中的最大值。注:如果有1个设置的值为0,而其他都为-1,那么生效的值为0。

    修改这个属性,会同时产生正面和负面影响,当我们不确定的时候,不要改动此值。 

    1.9 cpuset.memory_migrate

    一般情况下,allocate一个page之后,只要它一直保持着allocate,那么这个page就会一直保持着,即便cpuset.mems后来发生了改变。
    而当设置了这个flag之后,当task迁移到新的cpuset下时,就会将在原先cpuset所allocate的内存迁移到新的cpuset下,并且是与原先相对应的page位置。
    比如,原先内存申请在原cpuset的第2个内存节点;那么在新cpuset也会使用第2个内存节点。迁移操作时,就会尽可能地将新cpuset中的第2个节点释放出来。

    2. 使用example和语法

    Andoird下cpuset的路径:

    htc_imedugl:/ # ls /dev/cpuset
    audio-app             effective_cpus          mems                     
    background            effective_mems          notify_on_release        
    camera-daemon         foreground              release_agent            
    camera-video          mem_exclusive           restricted               
    cgroup.clone_children mem_hardwall            sched_load_balance       
    cgroup.procs          memory_migrate          sched_relax_domain_level 
    cgroup.sane_behavior  memory_pressure         system-background        
    cpu_exclusive         memory_pressure_enabled tasks                    
    cpus                  memory_spread_page      top-app                  
    memory_spread_slab

    查看cpus和mems资源:

    htc_imedugl:/dev/cpuset # cat cpus                                             
    0-7
    htc_imedugl:/dev/cpuset # cat mems
    0

    当前Android下,内存节点mems资源仅有一个。所以可以理解为cpuset仅控制了cpu core资源的分配。

    2.1 创建新cpuset

    在根节点下,创建一个新的cpuset,命名为 cpuset-test:

    mkdir /dev/cpuset/cpuset-test

     创建好后,目录如下:

    htc_imedugl:/ # ls /dev/cpuset/cpuset-test                                     
    cgroup.clone_children mem_exclusive      mems                     
    cgroup.procs          mem_hardwall       notify_on_release        
    cpu_exclusive         memory_migrate     sched_load_balance       
    cpus                  memory_pressure    sched_relax_domain_level 
    effective_cpus        memory_spread_page tasks                    
    effective_mems        memory_spread_slab 

    但是其中cpus,mems都为空。这个需要手动设置填充:

    未设置前,为空(此时进程是不能attach进来的):
    htc_imedugl:/ # cat dev/cpuset/cpuset-test/mems
    
    htc_imedugl:/ # cat dev/cpuset/cpuset-test/cpus

    2.2 修改cpuset

    预设cpus为2-3,那么attach到此cpuset下的进程,就会限制在CPU 2-3上运行;而mems则是使用根节点的值 0,不受限制。

    填充:
    htc_imedugl:/dev/cpuset/cpuset-test # echo 2-3 > cpus
    htc_imedugl:/dev/cpuset/cpuset-test # echo 0 > mems
    
    htc_imedugl:/dev/cpuset/cpuset-test # cat cpus                                         
    2-3
    htc_imedugl:/dev/cpuset/cpuset-test # cat mems                                         
    0

    2.3 进程attach

    假设我们创建一个sh进程,获得它的pid:

    htc_imedugl:/dev/cpuset/ljj # while true; do a=a+1; done &
    [1] 25035

    此时默认该进程是挂在根节点的cpuset下:

    htc_imedugl:/dev/cpuset/cpuset-test # cat /proc/25035/cgroup
    4:cpuset:/
    3:cpu:/
    2:schedtune:/
    1:cpuacct:/uid_0/pid_5099
    0::/

    再把该进程attach到cpuset-test下面:

    把进程的pid echo到cpuset-test/tasks,就可以完成attach:
    htc_imedugl:/dev/cpuset/cpuset-test # echo 25035 > tasks htc_imedugl:/dev/cpuset/cpuset-test # cat tasks 25035 并且查看进程信息: htc_imedugl:/dev/cpuset/cpuset-test # cat /proc/25035/cgroup 4:cpuset:/cpuset-test 3:cpu:/ 2:schedtune:/ 1:cpuacct:/uid_0/pid_5099 0::/

     抓取systrace,可以看到进程确实运行在CPU2,CPU3上:

    假如再修改cpus限制为core 7:

    htc_imedugl:/dev/cpuset/cpuset-test # echo 7 > cpus
    htc_imedugl:/dev/cpuset/cpuset-test # cat cpus                                 
    7

    再抓取systrace,可以看到25035进程已经被限制运行在CPU 7上:

    2.4 删除cpuset

    当cpuset中仍然有进程attach,是无法删除的:

    rmdir: cpuset-test/: Device or resource busy

    所以,我们先要把tasks移出,然后执行:

    htc_imedugl:/dev/cpuset # rmdir cpuset-test/

     就可以把cpuset删除了

    2.5 其他flag设置

    设置flag的方法都是写入int值的方式。

    例如,cpu互斥:

    echo 1 > cpu_exclusive
  • 相关阅读:
    协成
    java设计模式之中介者模式
    java设计模式之状态模式
    java设计模式之命令模式
    java设计模式之迭代器模式
    java设计模式之模板方法模式
    /ppp profile up-down script 的变量
    iptables常用配置
    站群服务器多IP配置L2TP多出口
    ARCH LINUX 配置DHCPCD 静态IP
  • 原文地址:https://www.cnblogs.com/lingjiajun/p/12378474.html
Copyright © 2011-2022 走看看