zoukankan      html  css  js  c++  java
  • 调度器23—EAS Hello

    基于 Linux-5.10

    一、EAS概述

    EAS在CPU调度领域,在为任务选核是起作用,目的是保证性能的情况下尽可能节省功耗。其基于的能量模型框架(EnergyModel (EM) framework)是一个通用的接口模块,该模块连接了支持不同 perf level 的驱动模块和系统中的其他想要感知能量消耗的模块。其中这里说的EAS,CPU调度器和CPU驱动模块就是一个典型的例子,调度器希望能够感知底层CPU能量的消耗,从而做出更优的选核策略。对于CPU设备,各个Cluster有各自独立的调频机制,Cluster内的CPU统一工作在一个频率下(Qcom做了更改,使每个CPU频点都可以不同)。因此每个Cluster就会形成一W个性能域(performance domain)。调度器通过 EM framework 接口可以获取CPU在各个 performance level 的能量消耗。

    二、相关数据结构

    1. struct perf_domain

    struct perf_domain {
        struct em_perf_domain *em_pd;
        struct perf_domain *next; //构成单链表
        struct rcu_head rcu; //保护此链表的rcu
    };

    perf_comain 结构表示一个CPU性能域,每一个性能域都是由 perf_domain 抽象。perf_comain 和 cpufreq policy 是一一对应的,对于一个4+3+1结构的平台,因此系统共计3个perf domain,形成链表,链表头在全局 root_domain 中。root_domain 的其它相关成员一并列出,如下:

    struct root_domain {
        ...
        int        overload; //该root domain,即系统是否处于overload状态
        int        overutilized; //该root domain,即系统是否处于overutilized状态
        unsigned long    max_cpu_capacity; //系统中算力最大的那个cpu的算力
        struct perf_domain __rcu *pd; //perf_domain单链表的表头
    }

    澄清一下 overloadoverutilized 这两个概念,在单个cpu overload/overutilized 基础上,又定义了 root domain(即整个系统)的overload和overutilized。

    (1) 对于一个 CPU 而言,其处于 overload 状态则说明其 rq 上有大于等于2个任务,或者虽然只有一个任务,但是是 misfit task。
    (2) 对于一个 CPU 而言,其处于 overutilized 状态说明该 cpu 的 utility 超过其 capacity(缺省预留20%的算力,另外,这里的 capacity 是用于cfs任务的算力)。
    (3) 对于 root domain,overload 表示至少有一个 cpu 处于 overload 状态。overutilized 表示至少有一个 cpu 处于 overutilized 状态。

    overutilized 状态非常重要,它决定了调度器是否启用EAS,只有在系统没有 overutilized 的情况下EAS才会生效。overload和newidle balance的频次控制相关,当系统在overload的情况下,newidle balance才会启动进行均衡。

    2. struct em_perf_domain

    struct em_perf_domain {
        struct em_perf_state *table; //performance states, 里面的频点必须是升序排列,em_cpu_energy()依赖于此
        int nr_perf_states; //table中元素的个数
        int milliwatts; //指示功率值的标志,以毫瓦或其他一些标度为单位。
        unsigned long cpus[]; //此性能域包含哪些cpu
    };

    此结构存放在cpu对应的 struct device 结构中,那就是per-cpu的结构了!在 EM framework 中,使用 em_perf_domain 来抽象一个 performance domain。

    3. struct em_perf_state

    struct em_perf_state {
        unsigned long frequency; //单位KHz,与 CPUFreq 保持一致
        unsigned long power; //单位毫瓦,此频点下的功率,它可以是总的功耗:静态+动态
        unsigned long cost; //与频点下的成本系数,功率计算过程中使用,等于:power * max_frequency / frequency
    };

    每个性能域都有若干个 perf level,每一个 perf level 对应能耗是不同的,使用用 struct em_perf_state 来表示一个 perf level 下的能耗信息。

    三、能量计算

    1. 能量计算方法概述

    基本计算公式:能量 = 功率 X 时间

    对于CPU而言,要计算其能量需要进一步细化公式(省略了CPU处于idle状态的能耗):CPU在此频点的能量消耗 = CPU在此频点的功率 X CPU在此频点的运行时间

    EM中记录了CPU各个频点的功率,使用 em_perf_state::power 来表示,这是事先Soc供应商计算好的。而运行时间是通过 cpu utility 来表示的。有一个不太方便的地方就是CPU utility 是归一化到 1024 的一个值,失去了在某个频点的运行时间长度的信息,不过可以转换:CPU在此频点运行时间 = cpu_util / cpu_current_capacity。注意计算energy只是为了比较大小,就省略了周期。

    CPU在某个 perf-state(即某个频点)下的算力:ps->cap = scale_cpu * (ps->freq / cpu_max_freq) ----(1)。scale_cpu是cpu在最大频点时scale到1024后的算力。

    不考虑 idle state 的功耗,CPU在某个 perf-state 的能量估计:cpu_nrg = ps->power * (cpu_util / ps->cap) ------(2)

    把(1)带入(2)得到:cpu_nrg = (ps->power * cpu_max_freq /ps->freq) * (cpu_util / scale_cpu) ----(3)

    上面公式的第一项是一个常量,保存在 em_perf_state 的 cost 成员中,即CPU在某个 perf-state 的能量估计:cpu_nrg = ps->cost * cpu_util / scale_cpu ----(4)

    由于每个 perf domain 中的微架构都是一样的,因此 scale_cpu 的值是一样的,那么 cost 也是一样的,通过提取公因式可以得到整个 perf domain(cpu cluster)的能耗公式:pd_nrg = ps->cost * \Sum cpu_util / scale_cpu ----(5)

    三、能量模型的构建

    1. perf domain 的构建

    在CPU拓扑初始化的时候,通过 build_perf_domains() 创建各个perf domain,并作为 root domain 的 perf domain 链表。

    (1) 函数实现

    //kernel/sched/topology.c
    static bool build_perf_domains(const struct cpumask *cpu_map)
    {
        int i, nr_pd = 0, nr_ps = 0, nr_cpus = cpumask_weight(cpu_map);
        struct perf_domain *pd = NULL, *tmp;
        int cpu = cpumask_first(cpu_map);
        struct root_domain *rd = cpu_rq(cpu)->rd;
        bool eas_check = false;
    
        if (!sysctl_sched_energy_aware) //若没有使能EAS,直接不build,退出
            goto free;
        ...
        for_each_cpu(i, cpu_map) {
            /* Skip already covered CPUs. */
            if (find_pd(pd, i)) //跳过已经被包含在某个pd中的cpu,这样的话各个cluster的首个cpu才能继续往下执行
                continue;
    
            /* Create the new pd and add it to the local list. */
            tmp = pd_init(i);
            tmp->next = pd; //单链表上元素个数为cpu的个数
            pd = tmp; //等同于头插法,后probe的cluster插入在链表头,pd指向链表头
    
            /* Count performance domains and performance states for the complexity check. */
            nr_pd++;
            //所有pd的ps的数量之和
            nr_ps += em_pd_nr_perf_states(pd->em_pd); //return pd->nr_perf_states;
        }
    
        /* Bail out if the Energy Model complexity is too high. */
        if (nr_pd * (nr_ps + nr_cpus) > EM_MAX_COMPLEXITY) { //2048 能量模型的复杂度不能太高
            WARN(1, "rd %*pbl: Failed to start EAS, EM complexity is too high\n", cpumask_pr_args(cpu_map));
            goto free;
        }
    
        //打印整个pd的信息,debug才会打印
        perf_domain_debug(cpu_map, pd);
    
        /* Attach the new list of performance domains to the root domain. */
        tmp = rd->pd;
        rcu_assign_pointer(rd->pd, pd); //全局变量 root_domain::pd 指向perf_domain链表头
        if (tmp)
            call_rcu(&tmp->rcu, destroy_perf_domain_rcu); //rcu更新,root_domain::pd指向新的,删除旧的
        
        pr_info("nr_pd = %d\n", nr_pd); //cpu7没有isolate就是3,否则就是2
    
        return !!pd;
    
    free:
        free_pd(pd);
        tmp = rd->pd;
        rcu_assign_pointer(rd->pd, NULL);
        if (tmp)
            call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
    
        return false;
    }
    
    static struct perf_domain *pd_init(int cpu)
    {
        struct em_perf_domain *obj = em_cpu_get(cpu);
        struct perf_domain *pd = kzalloc(sizeof(*pd), GFP_KERNEL);
        pd->em_pd = obj; //指针指向
        return pd;
    }
    //kernel/power/energy_model.c
    struct em_perf_domain *em_cpu_get(int cpu)
    {
        struct device *cpu_dev = get_cpu_device(cpu); //return per_cpu(cpu_sys_devices, cpu);
        return em_pd_get(cpu_dev); //return dev->em_pd, 直接存放在cpu对应的device结构中的
    }

    root_domain的pd成员指向的perf-domain链表是头插法,各个pd在链表上的顺序是 root_domain->pd --> cluster2->pd --> --> cluster1->pd --> cluster0->pd。pd是per-cluster的,不是per-cpu的。只有一个cluster 的全部cpu都被isolate了,其pd才会从链表上删除。

    (2)调用路径:

                init_cpu_capacity_callback //arch_topology.c 初始化时cpu算力更新时执行
                    schedule_work(&update_topology_flags_work);
                        init_cpu_capacity_callback //arch_topology.c
                            update_topology_flags_workfn //arch_topology.c
                            cpuset_hotplug_workfn //cpuset.c 下面有调用路径
                        ///proc/sys/kernel/sched_energy_aware 的响应函数
                            sched_energy_aware_handler //topology.c
                                rebuild_sched_domains //cpuset.c
            pause_cpus //cpu.c 执行出错的时候调用
            //cpu.c cpuhp_hp_states[]的"sched:active"的.startup.single回调
        resume_cpus    //cpu.c 
            sched_cpus_activate //core.c
        pause_cpus //cpu.c 
            sched_cpus_deactivate_nosync //core.c
                sched_cpu_activate //core.c
                    cpuset_cpu_active //core.c
        //cpu.c cpuhp_hp_states[]的"sched:active"的.teardown.single回调
    resume_cpus    //cpu.c 
        sched_cpus_activate //core.c
            sched_cpu_deactivate //core.c
        pause_cpus //cpu.c 
            sched_cpus_deactivate_nosync //core.c
                _sched_cpu_deactivate    
                    cpuset_cpu_inactive
                        cpuset_update_active_cpus //cpuset.c
                    cpuset_track_online_nodes_nb.notifier_call //cpu.c
                        cpuset_track_online_nodes //cpuset.c
                            schedule_work(&cpuset_hotplug_work);
                    resume_cpus //cpu.c
                        cpuset_update_active_cpus_affine //cpuset.c
                            schedule_work_on(cpu, &cpuset_hotplug_work); //调用指定cpu上的
                    //文件/dev/cpuset/[<group>/]cpus、mems的写回调函数
                        cpuset_write_resmask //cpuset.c
                            flush_work(&cpuset_hotplug_work); //执行完工作队列上的任务,只是flush
                                cpuset_hotplug_workfn //工作队列处理函数
                                    rebuild_sched_domains_locked
                                        partition_sched_domains_locked
                                            build_perf_domains  //每次传参都是cpu0-6 或 cpu0-7,不是一个cluster一个cluster传参的

    cpuset_hotplug_workfn 下面的部分是加 dump_stack() 打印出来的内容,内核启动、online/offline、isolate/unisolate 调用的路径相同。上面部分是按代码实现上找出的调用路径。

    在每个CPU online/offline、isolate/unisolate 都会触发domain的rebuild流程。

    cgroup分组中指定对cpus文件的改动不会触发rebuild流程。

    em_pd->cpus仅仅表示一个cluster包含哪些cpu,isolate/unisolate和online/offline cpu对其值没影响。

    四、EAS全局控制开关

    sysctl全局控制变量为 sysctl_sched_energy_aware,对应的控制文件为 /proc/sys/kernel/sched_energy_aware。

    //kernel/sched/topology.c
    int sched_energy_aware_handler(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos)
    {
        int ret, state;
    
        if (write && !capable(CAP_SYS_ADMIN))
            return -EPERM;
    
        ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
        if (!ret && write) {
            state = static_branch_unlikely(&sched_energy_present);
            if (state != sysctl_sched_energy_aware) {
                mutex_lock(&sched_energy_mutex);
                sched_energy_update = 1; //partition_sched_domains_locked中唯一使用
                rebuild_sched_domains();
                sched_energy_update = 0;
                mutex_unlock(&sched_energy_mutex);
            }
        }
    
        return ret;
    }

    sched_energy_present 的更新:

    /*
     * kernel/sched/topology.c
     * partition_sched_domains_locked --> sched_energy_set
     */
    static void sched_energy_set(bool has_eas)
    {
        if (!has_eas && static_branch_unlikely(&sched_energy_present)) {
            static_branch_disable_cpuslocked(&sched_energy_present);
        } else if (has_eas && !static_branch_unlikely(&sched_energy_present)) {
            static_branch_enable_cpuslocked(&sched_energy_present);
        }
    }

    sched_energy_enabled() 中判断这个static key值。使用位置有2:
    (1) 负载均衡路径中,find_busiest_group() 中判断使能了EAS并且系统没有overutilized,就终止此次balance。
    (2) 任务选核路径中,select_task_rq_fair() 中判断使能了EAS才会调用find_energy_efficient_cpu()进行EAS路径选核。

    五、EAS作用场景——EAS选核

    1. 唤醒场景的进入EAS选核的条件

    对于阻塞状态的任务,异步事件或者其他线程调用 try_to_wake_up() 会唤醒该线程,唤醒后会进行task placement,也即为唤醒任务选核。如果使能了EAS,那么优先采用EAS选核。当然,只有在轻载(系统没有overutilized)才会启用EAS,重载下(只要有一个cpu处于over utilized状态)还是使用传统内核算法选核。

    static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
    {
        int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
        ...
        trace_android_rvh_select_task_rq_fair(p, prev_cpu, sd_flag, wake_flags, &target_cpu);
        if (target_cpu >= 0)
            return target_cpu;
        ...
        //只有唤醒场景才有可能走EAS选核路径
        if (sd_flag & SD_BALANCE_WAKE) {
            //sysctl全局使能控制是否开启EAS
            if (sched_energy_enabled()) {
                new_cpu = find_energy_efficient_cpu(p, prev_cpu, sync);
                if (new_cpu >= 0)
                    return new_cpu; //只要EAS选到核,就使用EAS的选核结果
            }
        }
        ...
    }

    由上面代码可见,EAS只用于wakeup,fork和exec均衡都不走EAS选核算法。find_energy_efficient_cpu()是EAS的主选核路径,使用EAS选核需要满足两个条件:是唤醒路径且使能了EAS特性。EAS选中了适合的CPU就直接返回。如果EAS选核不成功,那么恢复缺省cpu为prev cpu,走传统选核路径重新选核。

    2. EAS选核细节

    EAS选核细节在 find_energy_efficient_cpu() 中体现,如下:

    static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu, int sync)
    {
        unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
        struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
        int max_spare_cap_cpu_ls = prev_cpu, best_idle_cpu = -1;
        unsigned long max_spare_cap_ls = 0, target_cap;
        unsigned long cpu_cap, util, base_energy = 0;
        bool boosted, latency_sensitive = false;
        unsigned int min_exit_lat = UINT_MAX;
        int cpu, best_energy_cpu = prev_cpu;
        struct cpuidle_state *idle;
        struct sched_domain *sd;
        struct perf_domain *pd;
        int new_cpu = INT_MAX;
    
        //更新任务负载
        sync_entity_load_avg(&p->se);
        //Vendor厂商或ODM厂商可能会注册hook从而不使用这个函数
        trace_android_rvh_find_energy_efficient_cpu(p, prev_cpu, sync, &new_cpu); //hook
        if (new_cpu != INT_MAX)
            return new_cpu;
    
        rcu_read_lock();
        //从rd中获取 perf_domain 链表
        pd = rcu_dereference(rd->pd);
        if (!pd || READ_ONCE(rd->overutilized)) //若系统中只要有一个cpu是overutilized,就退出EAS选核
            goto fail;
    
        cpu = smp_processor_id(); //当前正在运行的cpu
        //若是同步唤醒,且当前cpu只有本任务在运行,且唤醒的任务运行运行在此cpu上,且此cpu的算力能满足被唤醒任务的需求,那就直接选择当前cpu
        if (sync && cpu_rq(cpu)->nr_running == 1 && cpumask_test_cpu(cpu, p->cpus_ptr) && task_fits_capacity(p, capacity_of(cpu))) { //uclamp util小于80%的CPU算力
            rcu_read_unlock();
            return cpu;
        }
    
        /* Energy-aware wake-up happens on the lowest sched_domain starting from sd_asym_cpucapacity spanning over this_cpu and prev_cpu. */
        sd = rcu_dereference(*this_cpu_ptr(&sd_asym_cpucapacity)); //返回此cpu对应的DIE层级的sd
        /* 若prev_cpu是个有效的cpuid,在手机上,这个判断完全是多于的,因为只有DIE层级
       * 从最低level的且包括不同算力CPU的sd开始向上搜索,直到该sd覆盖了this_cpu和prev_cpu为止。对于手机平台就是DIE层级的sd。之所以要求包括异构CPU的sd是因为同构的cpu不需要EAS,不会有功耗的节省。
       */ while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) sd = sd->parent; if (!sd) goto fail; //max(util, util_est) 任务p的util为0,goto unlock是直接返回prev_cpu if (!task_util_est(p)) goto unlock; //待唤醒任务p所在cgroup的是否设置了cpu.uclamp.latency_sensitive 标志 latency_sensitive = uclamp_latency_sensitive(p); //受全局和cgroup限制后的任务p的uclamp min是否还大于0 boosted = uclamp_boosted(p); target_cap = boosted ? 0 : ULONG_MAX; //这个值是根据下面的使用逻辑赋的 //从大核开始遍历,次序为:大核-->中核-->小核。这个遍历次序无关紧要,疑问是所有都遍历完才做的决策 for (; pd; pd = pd->next) { //循环体中的变量,遍历每个pd时都是新的 unsigned long cur_delta, spare_cap, max_spare_cap = 0; unsigned long base_energy_pd; int max_spare_cap_cpu = -1; /* Compute the 'base' energy of the pd, without @p */ //计算不包括p的情况下此pd的energy,作为基准energy。注意dst_cpu传-1,p的util也会从其之前运行的cpu上被减去 base_energy_pd = compute_energy(p, -1, pd); //不包括p的情况下系统的总energy base_energy += base_energy_pd; /* * 这里竟然没有判断是否为active的cpu! pd->em_pd->cpus仅表示一个cluster包含哪些cpu, offline的cpu会从sd->span * 中清除掉,但是isolated的不会。上面'base' energy的计算也可能有问题。 */ for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) { if (!cpumask_test_cpu(cpu, p->cpus_ptr)) //过滤掉p不允许运行的cpu核 continue; util = cpu_util_next(cpu, p, cpu); //计算p放到此cpu上后此cpu上的util cpu_cap = capacity_of(cpu); spare_cap = cpu_cap; lsub_positive(&spare_cap, util); //计算p放到此cpu上后此cpu还剩余的算力 /* * Skip CPUs that cannot satisfy the capacity request. IOW, placing the task there would make the CPU * overutilized. Take uclamp into account to see how much capacity we can get out of the CPU; this is * aligned with schedutil_cpu_util(). */ //对util进行一下uclamp,若clmap后cpu算力不满足需求了,就放弃此cpu的继续探测 util = uclamp_rq_util_with(cpu_rq(cpu), util, p); if (!fits_capacity(util, cpu_cap)) continue; /* Always use prev_cpu as a candidate. */ if (!latency_sensitive && cpu == prev_cpu) { //若对延迟不敏感,且对比的这个cpu就是任务之前运行的cpu prev_delta = compute_energy(p, prev_cpu, pd); //计算p放在prev_cpu后整个pd的energy prev_delta -= base_energy_pd; //计算p放在prev_cpu后整个pd的增加的energy best_delta = min(best_delta, prev_delta); //这里又取最小值 } /* * Find the CPU with the maximum spare capacity in the performance domain */ //记录p放上去后剩余算力最大的cpu和其剩余算力 if (spare_cap > max_spare_cap) { max_spare_cap = spare_cap; max_spare_cap_cpu = cpu; } if (!latency_sensitive) //若对延迟不敏感,取消对此cpu的继续探测 continue; /*--- 下面就是延迟敏感情况下的才需要执行的 ---*/ if (idle_cpu(cpu)) { cpu_cap = capacity_orig_of(cpu); //若是boosted,target_cap 初始化为0。若是boost,尽量往算力大的CPU上选 if (boosted && cpu_cap < target_cap) continue; //若是非boosted,target_cap 初始化为ULONG_MAX。若是非boost,尽量往算力小的CPU上选 if (!boosted && cpu_cap > target_cap) continue; idle = idle_get_state(cpu_rq(cpu)); //return rq->idle_state; //CPU算力相等的情况下,选idle退出延迟最小的。若exit_latency上变为">=",有利于从cluster的首个CPU开始选 if (idle && idle->exit_latency > min_exit_lat && cpu_cap == target_cap) continue; if (idle) //对idle的判断只是避免程序崩溃而已,记录合适idle cpu的退出延迟,这里不是最小退出延迟的意思。 min_exit_lat = idle->exit_latency; target_cap = cpu_cap; //保存idle cpu的算力 best_idle_cpu = cpu; //记录认为是最好的idle cpu: } else if (spare_cap > max_spare_cap_ls) { //延迟敏感,又非idle cpu max_spare_cap_ls = spare_cap; //记录最大空余算力 max_spare_cap_cpu_ls = cpu; //记录最大空余算力的cpu } } /*---下面就是一个cluster的cpu遍历完后的处理---*/ /* Evaluate the energy impact of using this CPU.*/ if (!latency_sensitive && max_spare_cap_cpu >= 0 && max_spare_cap_cpu != prev_cpu) { //计算p放在当前cluster的最大空余算力的cpu上后其pd的energy增量,和其它所有cpu对比这个增量,取较小的 cur_delta = compute_energy(p, max_spare_cap_cpu, pd); cur_delta -= base_energy_pd; if (cur_delta < best_delta) { best_delta = cur_delta; best_energy_cpu = max_spare_cap_cpu; } } } //下面就是遍历完了: unlock: rcu_read_unlock(); if (latency_sensitive) return best_idle_cpu >= 0 ? best_idle_cpu : max_spare_cap_cpu_ls; /* * Pick the best CPU if prev_cpu cannot be used, or if it saves at least 6% of the energy used by prev_cpu. */ if (prev_delta == ULONG_MAX) return best_energy_cpu; //放在prev_cpu上的energy增量与放在每个cluster空余算力最大的cpu上energy增量的差值,大于把任务p放在prev_cpu上energy消耗的6.25% if ((prev_delta - best_delta) > ((prev_delta + base_energy) >> 4)) return best_energy_cpu; return prev_cpu; fail: rcu_read_unlock(); return -1; }

    原生逻辑,所有遍历下,只计算了nr_cluster+2次energy,分别是任务p不放在任何cpu上的基准energy、计算放在prev_cpu上的energy、计算放在每个cluster的最大空余算力的cpu时cluster的energy。主要比较的是在非latency_sensitive的情况下,将任务p放置在各个cluster的剩余算力最大的cpu上,然后对比,选一个能量增量最小的具有最大空余算力的cpu作为备选cpu。

    总结这个函数选核逻辑如下:

    (1) 若任务p是latency_sensitive的,若best_idle_cpu存在就返回best_idle_cpu,若best_idle_cpu不存在就返回空余算力最大的cpu。best_idle_cpu筛选的条件为:
    a. 首先需要是idle的cpu.
    b. 若任务p被uclamp min值了,就认为是boost的,那么就尽量往算力大的CPU上选,否则尽量往算力小的CPU上选。
    c. 若相同算力的CPU选退出延迟短的,也就是休眠深度浅的CPU.
    空余算力最大的cpu的筛选条件为:
    a. 首先需要是非idle的cpu.
    b. 其次需要是任务p放到此cpu上后,空余算力最大的cpu.

    (2) 若任务p是非latency_sensitive的,若prev_cpu不可用(任务p的亲和性不允许运行在prev_cpu或prev_cpu剩余的算力容纳不下任务p了),那么直接返回best_energy_cpu,
    best_energy_cpu的筛选条件:
    a. 默认是取prev_cpu的
    b. 每个cluster的最大空余算力的那个CPU之间进行PK,放置上任务p到其上后,哪个CPU的energy增量小,选哪一个CPU。

    (3) 若任务p是非latency_sensitive的,且prev_cpu可用,且放在prev_cpu上的energy增量与放在best_energy_cpu的差值,小于等于把任务p放在prev_cpu上energy消耗的6.25%,那么选prev_cpu. 因为能量节省有限,选prev_cpu可以减少cache miss. 可以增加一个优化,并且在cache_hot()的情况下才选prev_cpu。

    此函数用到的 compute_energy 函数:

    /*
     * 作用:计算任务p迁移到dst_cpu上后,整个pd,也就是此cluster的energy。若dst_cpu传-1,就表示
     * 任务p不运行在pd内的任何一个cpu上时,此pd的energy,也即是base energy。
     */
    static long compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
    {
        struct cpumask *pd_mask = perf_domain_span(pd);
        unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask)); //return per_cpu(cpu_scale, cpu); 此cpu的算力
        unsigned long max_util = 0, sum_util = 0;
        unsigned long energy = 0;
        int cpu;
    
        //对此pd中的每一个online cpu都执行
        for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
            //计算若p运行在dst_cpu上,此pd下各个cpu变化后的util值
            unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
            struct task_struct *tsk = cpu == dst_cpu ? p : NULL; //注意传参,这里可能恒为NULL
    
            //返回cfs+irq+rt+dl使用掉的cpu算力之和.注意这里传的是ENERGY_UTIL
            sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap, ENERGY_UTIL, NULL);
            //这次计算util考虑了uclamp,util大概率是往高处clamp的,dl的util计算方式也不同,这里使用的是带宽
            cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap, FREQUENCY_UTIL, tsk); //tsk是否为NULL只对clamp区间值有影响
            //取此pd中所有cpu的cpu_util的最大值
            max_util = max(max_util, cpu_util);
        }
        energy = em_cpu_energy(pd->em_pd, max_util, sum_util); //返回的是整个pd的energy
    
        return energy;
    }

    em_cpu_energy 来根据此cluster上所有cpu的util之和计算energy和通过util最大的那个cpu的util 去调频。

    /*
     * em_cpu_energy() - Estimates the energy consumed by the CPUs of a performance domain
     * @pd         : performance domain for which energy has to be estimated
     * @max_util : highest utilization among CPUs of the domain
     * @sum_util : sum of the utilization of all CPUs in the domain
     */
    /*
     * 作用:计算pd的energy,参数max_util用来为此cluster调频,sum_util用来计算此cluster即pd的energy
     */
    static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, unsigned long max_util, unsigned long sum_util)
    {
        unsigned long freq, scale_cpu;
        struct em_perf_state *ps;
        int i, cpu;
    
        if (!sum_util)
            return 0;
    
        cpu = cpumask_first(to_cpumask(pd->cpus));
        scale_cpu = arch_scale_cpu_capacity(cpu); //此pd下cpu的算力
        ps = &pd->table[pd->nr_perf_states - 1]; //由于是升序排列,这是最大的perf-state
        freq = map_util_freq(max_util, ps->frequency, scale_cpu); //return (freq + (freq >> 2)) * util / cap = 1.25 * (util / cap) * max_freq ;
    
        /*
         * Find the lowest performance state of the Energy Model above the requested frequency.
         */
        //找一个频点刚好大于等于计算出来的freq的em_perf_state
        for (i = 0; i < pd->nr_perf_states; i++) {
            ps = &pd->table[i];
            if (ps->frequency >= freq)
                break;
        }
    
        /*
         * The capacity of a CPU in the domain at the performance state (ps)
         * can be computed as:
         *
         *             ps->freq * scale_cpu
         *   ps->cap = --------------------                          (1)
         *                 cpu_max_freq
         *
         * So, ignoring the costs of idle states (which are not available in
         * the EM), the energy consumed by this CPU at that performance state
         * is estimated as:
         *
         *             ps->power * cpu_util
         *   cpu_nrg = --------------------                          (2)
         *                   ps->cap
         *
         * since 'cpu_util / ps->cap' represents its percentage of busy time.
         *
         *   NOTE: Although the result of this computation actually is in
         *         units of power, it can be manipulated as an energy value
         *         over a scheduling period, since it is assumed to be
         *         constant during that interval.
         *
         * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
         * of two terms:
         *
         *             ps->power * cpu_max_freq   cpu_util
         *   cpu_nrg = ------------------------ * ---------          (3)
         *                    ps->freq            scale_cpu
         *
         * The first term is static, and is stored in the em_perf_state struct
         * as 'ps->cost'.
         *
         * Since all CPUs of the domain have the same micro-architecture, they
         * share the same 'ps->cost', and the same CPU capacity. Hence, the
         * total energy of the domain (which is the simple sum of the energy of
         * all of its CPUs) can be factorized as:
         *
         *            ps->cost * \Sum cpu_util
         *   pd_nrg = ------------------------                       (4)
         *                  scale_cpu
         */
        return ps->cost * sum_util / scale_cpu; //就是之前计算的,整个pd的功耗
    }

    cpu_util_next 用来计算若将任务p放置在dst_cpu上后,此pd各个cpu的util。遍历此pd内的每个cpu就可以可以得到此pd内的sum_util,从而计算pd的energy。

    /*
     * Predicts what cpu_util(@cpu) would return if @p was migrated (and enqueued) to @dst_cpu.
     * 作用:预测若任务p迁移到参数dst_cpu上后,参数cpu上的util值
     */
    //compute_energy 传参:(cpu, p, -1) cpu为此pd中的某个cpu,注意dst_cpu传的是-1。-1就只可能减,不可能加了
    static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
    {
        struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
        unsigned long util_est, util = READ_ONCE(cfs_rq->avg.util_avg); //cfs_rq的util值,不会写回。
    
        /*
         * If @p migrates from @cpu to another, remove its contribution. Or,
         * if @p migrates from another CPU to @cpu, add its contribution. In
         * the other cases, @cpu is not impacted by the migration, so the
         * util_avg should already be correct.
         */
        //若此cpu是任务p之前运行的cpu,但是不是p将要运行的cpu
        if (task_cpu(p) == cpu && dst_cpu != cpu)
            sub_positive(&util, task_util(p)); //从cfs_rq的util中减去p的util
        //若此cpu不是任务p之前运行的cpu,但是是p将要运行的cpu
        else if (task_cpu(p) != cpu && dst_cpu == cpu)
            util += task_util(p); //cfs_rq的util加上p的util
    
        if (sched_feat(UTIL_EST)) {
            util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
    
            /*
             * During wake-up, the task isn't enqueued yet and doesn't
             * appear in the cfs_rq->avg.util_est.enqueued of any rq,
             * so just add it (if needed) to "simulate" what will be
             * cpu_util() after the task has been enqueued.
             */
            //若cpu就是任务p要运行的 cpu
            if (dst_cpu == cpu)
                util_est += _task_util_est(p);
    
            util = max(util, util_est);
        }
    
        //返回判断后cfs_rq的util
        return min(util, capacity_orig_of(cpu));
    }
    
    
    /*
     * 对于非dst cpu: compute_energy:传参(cpu, util_cfs, cpu_cap, ENERGY_UTIL, NULL) cpu为此pd下的某个cpu, 
     * util_cfs是这个cpu对应的任务p迁移到dst_cpu后的util, cpu_cap是此pd下单个cpu的算力
     * 
     * 对于dst cpu: compute_energy:传参(cpu, util_cfs, cpu_cap, FREQUENCY_UTIL, tsk)
     *
     * 为了减少篇幅,下面两个函数都删除了大量注释
    */
    /*
     * This function computes an effective utilization for the given CPU, to be
     * used for frequency selection given the linear relation: f = u * f_max.
     */
    //作用:计算cpu上的有效util
    unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs, unsigned long max, enum schedutil_type type, struct task_struct *p)
    {
        unsigned long dl_util, util, irq;
        struct rq *rq = cpu_rq(cpu);
    
        if (!uclamp_is_used() && type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
            return max;
        }
    
        irq = cpu_util_irq(rq);
        if (unlikely(irq >= max))
            return max;
    
        util = util_cfs + cpu_util_rt(rq); //return rq->avg_rt.util_avg
        if (type == FREQUENCY_UTIL) //FREQUENCY_UTIL 才会考虑uclamp,EAS的计算不考虑
            util = uclamp_rq_util_with(rq, util, p);
    
        dl_util = cpu_util_dl(rq); //return rq->avg_dl.util_avg
    
        if (util + dl_util >= max) //CFS+RT+DL 已经超过cpu的算力了
            return max;
    
        /*
         * OTOH, for energy computation we need the estimated running time, so
         * include util_dl and ignore dl_bw.
         */
        if (type == ENERGY_UTIL)
            util += dl_util;
    
        util = scale_irq_capacity(util, irq, max);
        util += irq; //util = util * (1 - irq/max) + irq
    
        if (type == FREQUENCY_UTIL)
            util += cpu_bw_dl(rq);
    
        return min(max, util); //返回cfs+irq+rt+dl后的cpu的算力
    }

    六、Soc厂商对原生逻辑的修改

    注意代码中的HOOK,厂商可能会修改导致不执行原生EAS选核逻辑。

    七、相关DEBUG文件

    1. DEBUG perf_domain 链表的程序

    /* 放到 kernel/sched 下面 */
    
    #define pr_fmt(fmt) "perf_domain_debug: " fmt
    
    #include <linux/fs.h>
    #include <linux/sched.h>
    #include <linux/proc_fs.h>
    #include <linux/seq_file.h>
    #include <linux/string.h>
    #include <linux/printk.h>
    #include <asm/topology.h>
    #include <linux/cpumask.h>
    #include <linux/sched/topology.h>
    #include "sched.h"
    
    
    struct perf_domain_debug_t {
        int cmd;
    };
    
    static struct perf_domain_debug_t pdd;
    
    
    static void perf_domain_debug(struct seq_file *m, struct perf_domain *pd)
    {
        int i;
        struct em_perf_domain *em_pd = pd->em_pd;
    
        seq_printf(m, "em_pd->nr_perf_states=%d, em_pd->milliwatts=%d, em_pd->cpus==%*pbl \n",
            em_pd->nr_perf_states, em_pd->milliwatts, cpumask_pr_args(to_cpumask(em_pd->cpus)));
    
        for (i = 0; i < em_pd->nr_perf_states; i++) {
            seq_printf(m, "[%d]: frequency=%lu, power=%lu, cost=%ld\n",
                    i, em_pd->table[i].frequency, em_pd->table[i].power, em_pd->table[i].cost);
        }
    
        seq_printf(m, "-------------------------------------------------------------------\n");
    }
    
    static int perf_domain_debug_show(struct seq_file *m, void *v)
    {
        struct root_domain *rd = cpu_rq(0)->rd;
        struct perf_domain *pd = rd->pd;
    
        while (pd) {
            perf_domain_debug(m, pd);
    
            pd = pd->next;
        }
    
        return 0;
    }
    
    static int perf_domain_debug_open(struct inode *inode, struct file *file)
    {
        return single_open(file, perf_domain_debug_show, NULL);
    }
    
    static ssize_t perf_domain_debug_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos)
    {
    
        int ret, cmd_value;
        char buffer[32] = {0};
    
        if (count >= sizeof(buffer)) {
            count = sizeof(buffer) - 1;
        }
        if (copy_from_user(buffer, buf, count)) {
            pr_info("copy_from_user failed\n");
            return -EFAULT;
        }
        ret = sscanf(buffer, "%d", &cmd_value);
        if(ret <= 0){
            pr_info("sscanf dec failed\n");
            return -EINVAL;
        }
        pr_info("cmd_value=%d\n", cmd_value);
    
        pdd.cmd = cmd_value;
    
        return count;
    }
    
    //Linux5.10 change file_operations to proc_ops
    static const struct proc_ops perf_domain_debug_fops = {
        .proc_open    = perf_domain_debug_open,
        .proc_read    = seq_read,
        .proc_write   = perf_domain_debug_write,
        .proc_lseek  = seq_lseek,
        .proc_release = single_release,
    };
    
    
    static int __init perf_domain_debug_init(void)
    {
        proc_create("perf_domain_debug", S_IRUGO | S_IWUGO, NULL, &perf_domain_debug_fops);
    
        pr_info("domain_topo_debug probed\n");
    
        return 0;
    }
    fs_initcall(perf_domain_debug_init);
    View Code

    2. 测试结果

    # cat /proc/perf_domain_debug
    em_pd->nr_perf_states=28, em_pd->milliwatts=1, em_pd->cpus==7
    [0]: frequency=1300000, power=308, cost=722615
    [1]: frequency=1400000, power=353, cost=769035
    [2]: frequency=1500000, power=393, cost=799100
    [3]: frequency=1600000, power=444, cost=846375
    [4]: frequency=1700000, power=490, cost=879117
    [5]: frequency=1800000, power=538, cost=911611
    [6]: frequency=1900000, power=588, cost=943894
    [7]: frequency=2000000, power=651, cost=992775
    [8]: frequency=2050000, power=691, cost=1028073
    [9]: frequency=2100000, power=732, cost=1063142
    [10]: frequency=2150000, power=785, cost=1113604
    [11]: frequency=2200000, power=830, cost=1150681
    [12]: frequency=2250000, power=876, cost=1187466
    [13]: frequency=2300000, power=922, cost=1222652
    [14]: frequency=2350000, power=971, cost=1260234
    [15]: frequency=2400000, power=1020, cost=1296250
    [16]: frequency=2450000, power=1088, cost=1354448
    [17]: frequency=2500000, power=1144, cost=1395680
    [18]: frequency=2550000, power=1198, cost=1432901
    [19]: frequency=2600000, power=1239, cost=1453442
    [20]: frequency=2650000, power=1299, cost=1495075
    [21]: frequency=2700000, power=1340, cost=1513703
    [22]: frequency=2750000, power=1403, cost=1556054
    [23]: frequency=2800000, power=1448, cost=1577285
    [24]: frequency=2850000, power=1511, cost=1617035
    [25]: frequency=2900000, power=1559, cost=1639637
    [26]: frequency=3000000, power=1674, cost=1701900
    [27]: frequency=3050000, power=1746, cost=1746000
    -------------------------------------------------------------------
    em_pd->nr_perf_states=32, em_pd->milliwatts=1, em_pd->cpus==4-6
    [0]: frequency=200000, power=21, cost=299250
    [1]: frequency=300000, power=31, cost=294500
    [2]: frequency=400000, power=41, cost=292125
    [3]: frequency=500000, power=55, cost=313500
    [4]: frequency=600000, power=70, cost=332500
    [5]: frequency=700000, power=87, cost=354214
    [6]: frequency=800000, power=104, cost=370500
    [7]: frequency=900000, power=125, cost=395833
    [8]: frequency=1000000, power=145, cost=413250
    [9]: frequency=1100000, power=169, cost=437863
    [10]: frequency=1200000, power=192, cost=456000
    [11]: frequency=1300000, power=215, cost=471346
    [12]: frequency=1400000, power=245, cost=498750
    [13]: frequency=1500000, power=272, cost=516800
    [14]: frequency=1600000, power=300, cost=534375
    [15]: frequency=1700000, power=335, cost=561617
    [16]: frequency=1800000, power=379, cost=600083
    [17]: frequency=1900000, power=420, cost=630000
    [18]: frequency=2000000, power=470, cost=669750
    [19]: frequency=2050000, power=496, cost=689560
    [20]: frequency=2100000, power=523, cost=709785
    [21]: frequency=2150000, power=543, cost=719790
    [22]: frequency=2200000, power=572, cost=741000
    [23]: frequency=2250000, power=602, cost=762533
    [24]: frequency=2300000, power=623, cost=771978
    [25]: frequency=2350000, power=645, cost=782234
    [26]: frequency=2400000, power=666, cost=790875
    [27]: frequency=2450000, power=690, cost=802653
    [28]: frequency=2550000, power=736, cost=822588
    [29]: frequency=2650000, power=783, cost=842094
    [30]: frequency=2750000, power=832, cost=862254
    [31]: frequency=2850000, power=880, cost=880000
    -------------------------------------------------------------------
    em_pd->nr_perf_states=30, em_pd->milliwatts=1, em_pd->cpus==0-3
    [0]: frequency=200000, power=14, cost=126000
    [1]: frequency=250000, power=19, cost=136800
    [2]: frequency=300000, power=23, cost=138000
    [3]: frequency=350000, power=28, cost=144000
    [4]: frequency=400000, power=32, cost=144000
    [5]: frequency=450000, power=37, cost=148000
    [6]: frequency=500000, power=43, cost=154800
    [7]: frequency=550000, power=47, cost=153818
    [8]: frequency=600000, power=53, cost=159000
    [9]: frequency=650000, power=59, cost=163384
    [10]: frequency=700000, power=63, cost=162000
    [11]: frequency=750000, power=70, cost=168000
    [12]: frequency=800000, power=76, cost=171000
    [13]: frequency=850000, power=81, cost=171529
    [14]: frequency=900000, power=87, cost=174000
    [15]: frequency=950000, power=94, cost=178105
    [16]: frequency=1000000, power=99, cost=178200
    [17]: frequency=1050000, power=108, cost=185142
    [18]: frequency=1100000, power=115, cost=188181
    [19]: frequency=1150000, power=125, cost=195652
    [20]: frequency=1200000, power=132, cost=198000
    [21]: frequency=1250000, power=140, cost=201600
    [22]: frequency=1300000, power=150, cost=207692
    [23]: frequency=1350000, power=158, cost=210666
    [24]: frequency=1400000, power=166, cost=213428
    [25]: frequency=1450000, power=177, cost=219724
    [26]: frequency=1500000, power=185, cost=222000
    [27]: frequency=1600000, power=205, cost=230625
    [28]: frequency=1700000, power=222, cost=235058
    [29]: frequency=1800000, power=243, cost=243000
    -------------------------------------------------------------------
    View Code

    cpu7被isolate的话cluster3就不会有了,pd单链表次序:cluster3 --> cluster2 --> cluster1。

    ps->cost = ps->power * cpu_max_freq / ps->freq,对于小核的第一个频点对应的cost也就是 14 * 1800000 / 200000 = 126,但是dump出来的cost=126000,看来是乘以1000了。

    八、总结

    EAS选核的进入条件是唤醒路径的选核,且系统是没有overutil的且EAS是使能的。要求的选核范围中包含Asym Capacity Cpu,在手机平台上对应DIE层级的各个sg中进行选择。首先找出各个cluster中空余算力最大的cpu作为备选cpu,然后计算将待选核任务p放在各个备选cpu上其perf-domain的energy增量,选出energy增量最小的cpu作为 best_energy_cpu。之后 best_energy_cpu 还要与 prev_cpu 进行能量PK,只有前者的能量收益超过prev_cpu的1/16时,才会选择best_energy_cpu,否则选择 prev_cpu。

    参考:https://blog.csdn.net/feelabclihu/article/details/122007603?spm=1001.2014.3001.5501

  • 相关阅读:
    Linux下sed,awk,grep,cut,find学习笔记
    Python文件处理(1)
    KMP详解
    Java引用详解
    解决安卓中页脚被输入法顶起的问题
    解决swfupload上传控件文件名中文乱码问题 三种方法 flash及最新版本11.8.800.168
    null id in entry (don't flush the Session after an exception occurs)
    HQL中的Like查询需要注意的地方
    spring mvc controller间跳转 重定向 传参
    node to traverse cannot be null!
  • 原文地址:https://www.cnblogs.com/hellokitty2/p/15738144.html
Copyright © 2011-2022 走看看