zoukankan      html  css  js  c++  java
  • CPU负载均衡之WALT学习【转】

    转自:https://blog.csdn.net/xiaoqiaoq0/article/details/107135747/

    前言

    本文继续整理CPU调度WALT相关内容,主要整理如下内容:

    1. WALT是什么?
    2. WALT 计算?
    3. WALT 计算数据如何使用?

    1. WALT是什么?

    WALT:Windows-Assist Load Tracing的缩写:
    - 从字面意思来看,是以window作为辅助项来跟踪CPU LOAD;
    - 实质上是一种计算方法,用数据来表现CPU当前的loading情况,用于后续任务调度、迁移、负载均衡等功能;

    1.1 为什么需要WALT ?

    对于一项技术的发展,尤其是一种计算方式的引入,一定是伴随着过去的技术不在适用于当前事务发展的需要,或者这项技术可以让人更懒;

    1.1.1 PELT的计算方式的不足?

    PELT的引进的时候,linux的主流还在于服务器使用,更多关注设备性能的体现,彼时功耗还不是考虑的重点,而随着移动设备的发展,功耗和响应速度成为被人们直接感知到的因素,成为当前技术发展主要考虑的因素:

    1. 对于当前的移动设备,在界面处理的应用场景,需要尽快响应,否则user会明显感觉到卡顿;
    2. 对于当前移动设备,功耗更是一个必须面对的因素,手机需要频繁充电,那销量一定好不了;
    3. 根据用户场景决定task是否heavy的要求,比如显示的内容不同,其task重要程度也不同,即同一个类别的TASK也需要根据具体情况动态改变;

    而基于当前PELT的调度情况(衰减的计算思路),更能体现连续的趋势情况,而对于快速的突变性质的情况,不是很友好:

    1. 对于快速上升和快速下降的情况响应速度较慢,由于衰减的计算过程,所以实际的Loading上升和下降需要一定周期后才能在数据上反馈出来,导致响应速度慢;
    2. PELT基于其衰减机制,所以对于一个task sleep 一段时间后,则其负载计算减小,但是如果此时该Task为网络传输这种,周期性的需要cpu和freq的能力,则不能快速响应(因为该计算方式更能体现趋向性、平均效果)

    1.2 WALT如何处理

    根据上述的原因,我们了解到,当前需要在PELT的基础上(保持其好处),实现一种更能适用于当前需求的计算方式:

    1. 数据上报更加及时;
    2. 数据直接体现现状;
    3. 对算力的消耗不会增加(算力);

    1.2.1 WALT 处理

    我这里总结了WALT所能(需要)做到的效果:

    1. 继续保持对于所有Task-entity的跟踪 ;
    2. 在此前usage(load)的基础上,添加对于demand的记录,用于之后预测;
    3. 每个CPU上runqueue 的整体负载仍为所有Task统计的sum;
    4. 核心在于计算差异,由之前的衰减的方式变更为划分window的方式:数据采集更能快速体现实际变化(对比与PELT的趋势),如下为Linux官方的一些资料:
      1. A task’s demand is the maximum of its contribution to the most recently completed window and its average demand over the past N windows.
      2. WALT “forgets” blocked time entirely:即只统计runable和running time,可以对于Task的实际耗时有更准确的统计,可以通过demand预测;
      3. CPU busy time - The sum of execution times of all tasks in the most recently completed window;
      4. WALT “forgets” cpu utilization as soon as tasks are taken off of the runqueue;

    1.2.2 应用补充

    1. task分配前各个CPU和task负载的统计;
    2. task migration 迁移
    3. 大小核的分配;
    4. EAS 分配;

    1.3 版本导入

    1. linux 4.8.2 之后导入(但是在bootlin查看code,最新5.8仍没有对应文件)
    2. android 4.4之后导入(android kernel 4.9 中是有这部分的)

    2. Kernel如何启用WALT

    android kernel code中已经集成了这部分内容,不过根据厂商的差异,可能存在没有启用的情况:

    1. 打开宏测试:
      1. menuconfig ==》Genernal setup ==》CPU/Task time and stats accounting ==》support window based load tracking
      2. 图示:kernel config
    2. 直接修改
      1. kernel/arch/arm64/config/defconfig中添加CONFIG_SCHED_WALT=y
    3. build image 验证修改是否生效:
      demo:/sys/kernel/tracing # zcat /proc/config.gz | grep WALT

      CONFIG_SCHED_WALT=y
      CONFIG_HID_WALTOP=y

    4. 测试
      当前只是在ftrace中可以看到确实有统计walt的数据,但是没有实际的应用来确认具体是否有改善或者其他数据(当然Linux的资料中有一些数据,但是并非本地测试);

    3. WALT计算

    本小节从原理和code 来说明,WALT采用的计算方式:

    1. windows 是如何划分的?
    2. 对于Task如何分类,分别做怎样的处理?
    3. WALT部分数据如何更新?
    4. WALT更新的数据如何被调度、EAS使用?

    3.1 Windows划分

    首先来看辅助计算项window是如何划分的?
    简单理解,就是将系统自启动开始以一定时间作为一个周期,分别统计不同周期内Task的Loading情况,并将其更新到Runqueue中;

    则还有哪些内容需要考虑?

    1. 一个周期即window设置为多久比较合适?这个根据实际项目不同调试不同的值,目前Kernel中是设置的标准是20ms;
    2. 具体统计多少个window内的Loading情况?根据实际项目需要调整,目前Kernel中设置为5个window;

    所以对于一个Task和window,可能存在如下几种情况:
    在这里插入图片描述
    ps:ms = mark_start(Task开始),ws = window_start(当前window开始), wc = wallclock(当前系统时间)

    1. Task在这个window内启动,且做统计时仍在这个window内,即Task在一个window内;
    2. Task在前一个window内启动,做统计时在当前window内,即Task跨过两个window;
    3. Task在前边某一个window内启动,做统计时在当前window内,即Task跨过多个完整window;
      在这里插入图片描述
      即Task在Window的划分只有上述三种情况,所有的计算都是基于上述划分的;

    3.2 Task 分类

    可以想到的是,对于不同类别的Task或者不同状态的Task计算公式都是不同的,WALT将Task划分为如下几个类别:
    Tadk分类
    上图中有将各个Task event的调用函数列出来;

    3.2.1 更新demand判断

    在更新demand时,会首先根据Task event判断此时是否需要更新:
    demand对类别的差异
    对应function:

    static int account_busy_for_task_demand(struct task_struct *p, int event)
    {
    	/* No need to bother updating task demand for exiting tasks
    	 * or the idle task. */
    	 //task 已退出或者为IDLE,则不需要计算
    	if (exiting_task(p) || is_idle_task(p))
    		return 0;
    
    	/* When a task is waking up it is completing a segment of non-busy
    	 * time. Likewise, if wait time is not treated as busy time, then
    	 * when a task begins to run or is migrated, it is not running and
    	 * is completing a segment of non-busy time. */
    	// 默认 walt_account_wait_time是1,则只有TASK_WAKE 
    	if (event == TASK_WAKE || (!walt_account_wait_time &&
    			 (event == PICK_NEXT_TASK || event == TASK_MIGRATE)))
    		return 0;
    
    	return 1;
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    3.2.2 更新CPU busy time判断

    在更新CPU busy time时,会首先根据Task event判断此时是否需要更新:
    busy time对event的差异
    对应function:

    static int account_busy_for_cpu_time(struct rq *rq, struct task_struct *p,
    				     u64 irqtime, int event)
    {
    //是否为idle task or other task?	
    	if (is_idle_task(p)) {
    		/* TASK_WAKE && TASK_MIGRATE is not possible on idle task! */
    		// 是schedule 触发的下一个task为idle task
    		if (event == PICK_NEXT_TASK)
    			return 0;
    	
    		/* PUT_PREV_TASK, TASK_UPDATE && IRQ_UPDATE are left */
    		// 如果是中断或者等待IO的IDLE TASK,是要计算busy time的;
    		return irqtime || cpu_is_waiting_on_io(rq);
    	}
    
    	//wake 唤醒操作不需要计算;
    	if (event == TASK_WAKE)
    		return 0;
    
    	//不是IDLE TASK则以下几个类型需要计算
    	if (event == PUT_PREV_TASK || event == IRQ_UPDATE ||
    					 event == TASK_UPDATE)
    		return 1;
    
    	/* Only TASK_MIGRATE && PICK_NEXT_TASK left */
    	//默认是0
    	return walt_freq_account_wait_time;
    }
    

    3.3 数据如何更新?(调用逻辑)

    前边两个小结已经介绍了Task在window上统计逻辑和不同Task统计不同数据判断,这里具体来看核心调用逻辑,首先上一张图:
    WALT
    这个图是在xmind导出来的结构图,不清楚是否可以放大查看,这里具体介绍流程:

    1. 入口函数walt_update_task_ravg
    2. demand更新函数
    3. cpu busy time 更新函数

    3.3.1 入口函数介绍

    walt_update_task_ravg
    对应function:

    /* Reflect task activity on its demand and cpu's busy time statistics */
    void walt_update_task_ravg(struct task_struct *p, struct rq *rq,
    		 int event, u64 wallclock, u64 irqtime)
    {
    	//判断返回
    	if (walt_disabled || !rq->window_start)
    		return;
    	lockdep_assert_held(&rq->lock);
    	//更新window_start和cum_window_demand
    	update_window_start(rq, wallclock);
    
    	if (!p->ravg.mark_start)
    		goto done;
    	//更新数据:demand和busy_time
    	update_task_demand(p, rq, event, wallclock);
    	update_cpu_busy_time(p, rq, event, wallclock, irqtime);
    
    done:
    	// trace
    	trace_walt_update_task_ravg(p, rq, event, wallclock, irqtime);
    	// 更新mark_start	
    	p->ravg.mark_start = wallclock;
    }
    

    函数主要做三件事情:

    1. 更新当前 window start时间为之后数据更新做准备;
    2. 更新对应task的demand数值,需要注意这里也会对应更新RQ中的数据;
    3. 更新对应task的cpu busy time占用;

    这个函数是WALT计算的主要入口,可以看到调用它的位置有很多,即上图最左侧内容,简单来说就是在中断、唤醒、迁移、调度这些case下都会更新Loading情况,这里不一一详细说明了;

    1. task awakend
    2. task start execute
    3. task stop execute
    4. task exit
    1. window rollover
    2. interrupt
    3. scheduler_tick
    1. task migration
    2. freq change

    3.3.2 更新window start

    这里主要是在计算之前更新window_start确保rq 窗口起始值准确:
    在这里插入图片描述
    对应function:

    static void
    update_window_start(struct rq *rq, u64 wallclock)
    {
    	s64 delta;
    	int nr_windows;
    	//计算时间
    	delta = wallclock - rq->window_start;
    	/* If the MPM global timer is cleared, set delta as 0 to avoid kernel BUG happening */
    	if (delta < 0) {
    		delta = 0;
    		/*
    		 * WARN_ONCE(1,
    		 * "WALT wallclock appears to have gone backwards or reset
    ");
    		 */
    	}
    
    	if (delta < walt_ravg_window) // 不足一个window周期,则直接返回;
    		return;
    
    	nr_windows = div64_u64(delta, walt_ravg_window);//计算window数量
    	rq->window_start += (u64)nr_windows * (u64)walt_ravg_window;//统计window_start时间
    
    	rq->cum_window_demand = rq->cumulative_runnable_avg;//实质还得使用cumulative_runnable_avg
    }
    

    3.3.3 更新demand

    3.3.3.1 demand主要逻辑:

    在这里插入图片描述
    对应function:

    /*
     * Account cpu demand of task and/or update task's cpu demand history
     *
     * ms = p->ravg.mark_start;
     * wc = wallclock
     * ws = rq->window_start
     *
     * Three possibilities:
     *
     *	a) Task event is contained within one window.
     *		window_start < mark_start < wallclock
     *
     *		ws   ms  wc
     *		|    |   |
     *		V    V   V
     *		|---------------|
     *
     *	In this case, p->ravg.sum is updated *iff* event is appropriate
     *	(ex: event == PUT_PREV_TASK)
     *
     *	b) Task event spans two windows.
     *		mark_start < window_start < wallclock
     *
     *		ms   ws   wc
     *		|    |    |
     *		V    V    V
     *		-----|-------------------
     *
     *	In this case, p->ravg.sum is updated with (ws - ms) *iff* event
     *	is appropriate, then a new window sample is recorded followed
     *	by p->ravg.sum being set to (wc - ws) *iff* event is appropriate.
     *
     *	c) Task event spans more than two windows.
     *
     *		ms ws_tmp			   ws  wc
     *		|  |				   |   |
     *		V  V				   V   V
     *		---|-------|-------|-------|-------|------
     *		   |				   |
     *		   |<------ nr_full_windows ------>|
     *
     *	In this case, p->ravg.sum is updated with (ws_tmp - ms) first *iff*
     *	event is appropriate, window sample of p->ravg.sum is recorded,
     *	'nr_full_window' samples of window_size is also recorded *iff*
     *	event is appropriate and finally p->ravg.sum is set to (wc - ws)
     *	*iff* event is appropriate.
     *
     * IMPORTANT : Leave p->ravg.mark_start unchanged, as update_cpu_busy_time()
     * depends on it!
     */
    static void update_task_demand(struct task_struct *p, struct rq *rq,
    		 int event, u64 wallclock)
    {
    	u64 mark_start = p->ravg.mark_start;//mark start 可以看到是task 的值;
    	u64 delta, window_start = rq->window_start;//window start是 rq的值;
    	int new_window, nr_full_windows;
    	u32 window_size = walt_ravg_window;
    
    	//第一个判断条件,ms和ws,即当前task的start实际是否在这个window内;	
    	new_window = mark_start < window_start;
    	if (!account_busy_for_task_demand(p, event)) {
    		if (new_window)
    			/* If the time accounted isn't being accounted as
    			 * busy time, and a new window started, only the
    			 * previous window need be closed out with the
    			 * pre-existing demand. Multiple windows may have
    			 * elapsed, but since empty windows are dropped,
    			 * it is not necessary to account those. */
    			update_history(rq, p, p->ravg.sum, 1, event);
    		return;
    	}
    
    	// 如果ms > ws,则是case a:将wc-ms,在此周期内的实际执行时间;
    	if (!new_window) {
    		/* The simple case - busy time contained within the existing
    		 * window. */
    		add_to_task_demand(rq, p, wallclock - mark_start);
    		return;
    	}
    
    	//超过 1个window的情况
    	/* Busy time spans at least two windows. Temporarily rewind
    	 * window_start to first window boundary after mark_start. */
    	//从ms 到 ws的时间,包含多个完整window
    	delta = window_start - mark_start;
    	nr_full_windows = div64_u64(delta, window_size);
    	window_start -= (u64)nr_full_windows * (u64)window_size;
    	//ws 计算到ws_tmp这里:
    	
    	/* Process (window_start - mark_start) first */
    	//先添加最开始半个周期的demand
    	add_to_task_demand(rq, p, window_start - mark_start);
    
    	/* Push new sample(s) into task's demand history */
    	
    	//更新history
    	update_history(rq, p, p->ravg.sum, 1, event);
    	if (nr_full_windows)
    		update_history(rq, p, scale_exec_time(window_size, rq),
    				   nr_full_windows, event);
    
    	/* Roll window_start back to current to process any remainder
    	 * in current window. */
    	// 还原 window_start 
    	window_start += (u64)nr_full_windows * (u64)window_size;
    
    	/* Process (wallclock - window_start) next */
    	//更新最后的周期,可以看到整体类似于pelt的计算,增加了history的操作;
    	mark_start = window_start;
    	add_to_task_demand(rq, p, wallclock - mark_start);
    }		
    
    //demand计算更新:
    static void add_to_task_demand(struct rq *rq, struct task_struct *p,
    		u64 delta)
    {
    	//demand需要做一次转换,将实际运行时间,转换为CPU 能力比例,一般就是获取CPU 的capcurr 然后除1024;
    	delta = scale_exec_time(delta, rq);
    	p->ravg.sum += delta;
    	//这里有个判断当sum超过window size的时候修改;
    	if (unlikely(p->ravg.sum > walt_ravg_window))
    		p->ravg.sum = walt_ravg_window;
    }
    3.3.3.2 update history 逻辑:

    update_history 整理:

    1. 本函数在Task进入一个新的Window的时候调用;
    2. 更新Task中的demand,根据过往几个Window的情况;
    3. 同步更新Rq中的Usage,根据当前demand计算值;
      在这里插入图片描述
      对应function:
    /*
     * Called when new window is starting for a task, to record cpu usage over
     * recently concluded window(s). Normally 'samples' should be 1. It can be > 1
     * when, say, a real-time task runs without preemption for several windows at a
     * stretch.
     */
     
    static void update_history(struct rq *rq, struct task_struct *p,
    			 u32 runtime, int samples, int event)
    {
    	u32 *hist = &p->ravg.sum_history[0];//对应window 指针链接
    	int ridx, widx;
    	u32 max = 0, avg, demand;
    	u64 sum = 0;
    
    	/* Ignore windows where task had no activity */
    	if (!runtime || is_idle_task(p) || exiting_task(p) || !samples)
    			goto done;
    
    	/* Push new 'runtime' value onto stack */
    	widx = walt_ravg_hist_size - 1;// history数量最大位置
    	ridx = widx - samples;//计算链表中需要去除的window数量
    
    //如下两个for循环就是将新增加的window添加到history链表中,并更新sum值和max值;	
    	for (; ridx >= 0; --widx, --ridx) {
    		hist[widx] = hist[ridx];
    		sum += hist[widx];
    		if (hist[widx] > max)
    			max = hist[widx];
    	}
    
    	for (widx = 0; widx < samples && widx < walt_ravg_hist_size; widx++) {
    		hist[widx] = runtime;
    		sum += hist[widx];
    		if (hist[widx] > max)
    			max = hist[widx];
    	}
    // Task中sum赋值;
    	p->ravg.sum = 0;
    
    //demand根据策略不同,从history window中计算,我们默认是policy2 就是 WINDOW_STATS_MAX_RECENT_AVG,在过去平均值和当前值中选择大的那个;
    	if (walt_window_stats_policy == WINDOW_STATS_RECENT) {
    		demand = runtime;
    	} else if (walt_window_stats_policy == WINDOW_STATS_MAX) {
    		demand = max;
    	} else {
    		avg = div64_u64(sum, walt_ravg_hist_size);
    		if (walt_window_stats_policy == WINDOW_STATS_AVG)
    			demand = avg;
    		else
    			demand = max(avg, runtime);
    	}
    
    	/*
    	 * A throttled deadline sched class task gets dequeued without
    	 * changing p->on_rq. Since the dequeue decrements hmp stats
    	 * avoid decrementing it here again.
    	 *
    	 * When window is rolled over, the cumulative window demand
    	 * is reset to the cumulative runnable average (contribution from
    	 * the tasks on the runqueue). If the current task is dequeued
    	 * already, it's demand is not included in the cumulative runnable
    	 * average. So add the task demand separately to cumulative window
    	 * demand.
    	 */
    //进行runnable_avg参数矫正,前提为并非deadline类型task	 
    	if (!task_has_dl_policy(p) || !p->dl.dl_throttled) {
    		if (task_on_rq_queued(p))//在runqueue中排队,但是没有实际执行
    			fixup_cumulative_runnable_avg(rq, p, demand);//在rq中添加当前demand和task中记录demand的差值,更新到cumulative_runnable_avg
    		else if (rq->curr == p)//当前执行的就是这个Task
    			fixup_cum_window_demand(rq, demand);//在rq中添加demand
    	}
    //最后将计算出来的demand更新到Task中;
    	p->ravg.demand = demand;
    
    done:
    	trace_walt_update_history(rq, p, runtime, samples, event);
    	return;
    }
    
    //更新cumulative_runnable_avg的值;
    static void
    fixup_cumulative_runnable_avg(struct rq *rq,
    			      struct task_struct *p, u64 new_task_load)
    {
    //计算demand和p中记录的demand差值(可能小于0)
    	s64 task_load_delta = (s64)new_task_load - task_load(p);
    //添加到rq中
    	rq->cumulative_runnable_avg += task_load_delta;
    	if ((s64)rq->cumulative_runnable_avg < 0)
    		panic("cra less than zero: tld: %lld, task_load(p) = %u
    ",
    			task_load_delta, task_load(p));
    //
    	fixup_cum_window_demand(rq, task_load_delta);
    }
    
    //更新cum_window_demand,直接累加传入值
    static inline void fixup_cum_window_demand(struct rq *rq, s64 delta)
    {
    	rq->cum_window_demand += delta;
    	if (unlikely((s64)rq->cum_window_demand < 0))
    		rq->cum_window_demand = 0;
    }
    
    //可以看到这里实际更新了:cum_window_demand、cumulative_runnable_avg
    //这两个还在如下函数中有更新:就一个+,一个-,
    void
    walt_inc_cumulative_runnable_avg(struct rq *rq,
    				 struct task_struct *p)
    {
    	rq->cumulative_runnable_avg += p->ravg.demand;
    
    	/*
    	 * Add a task's contribution to the cumulative window demand when
    	 *
    	 * (1) task is enqueued with on_rq = 1 i.e migration,
    	 *     prio/cgroup/class change.
    	 * (2) task is waking for the first time in this window.
    	 */
    	if (p->on_rq || (p->last_sleep_ts < rq->window_start))
    		fixup_cum_window_demand(rq, p->ravg.demand);
    }
    
    void
    walt_dec_cumulative_runnable_avg(struct rq *rq,
    				 struct task_struct *p)
    {
    	rq->cumulative_runnable_avg -= p->ravg.demand;
    	BUG_ON((s64)rq->cumulative_runnable_avg < 0);
    
    	/*
    	 * on_rq will be 1 for sleeping tasks. So check if the task
    	 * is migrating or dequeuing in RUNNING state to change the
    	 * prio/cgroup/class.
    	 */
    	if (task_on_rq_migrating(p) || p->state == TASK_RUNNING)
    		fixup_cum_window_demand(rq, -(s64)p->ravg.demand);
    }
    
    //在code中搜索了这两个函数的调用:
    //分别在fairdl
    tstop_task中调用enqueue时inc,dequeue时dec;
    //这部分计算会优先于rq中nr_running进行;
    

    函数的一些注解都在code中添加了,有任何疑问欢迎提出;

    3.3.3.3 demand更新函数总结:

    则demand更新主要做了如下内容:

    1. 计算包括task中间包括多个1个window以及多个window的情况,实质就是根据我们上文提到的窗口划分来做的;
    2. 需要注意的是本函数中window_start和mark_start都是局部变量,实际task内值并未更新,因为之后计算busy time还需要使用;
    3. demand 实质更新的就是task中ravg.sum以及rq中cumulative_runnable_avg 和cum_window_demand ;

    3.3.4 更新cpu busy time

    这个函数逻辑画出来更加庞大,主要是针对于不同的case做计算,计算划分都是前文提过的窗口划分,但是具体数值统计会有些许差异:
    在这里插入图片描述
    对应function:

    /*
     * Account cpu activity in its busy time counters (rq->curr/prev_runnable_sum)
     */
    static void update_cpu_busy_time(struct task_struct *p, struct rq *rq,
    		 int event, u64 wallclock, u64 irqtime)
    {
    	int new_window, nr_full_windows = 0;
    	int p_is_curr_task = (p == rq->curr);
    	u64 mark_start = p->ravg.mark_start;    //ms
    	u64 window_start = rq->window_start;    //ws
    	u32 window_size = walt_ravg_window;    //window size 
    	u64 delta;
    
    	//初始变量值获取
    	new_window = mark_start < window_start;// is task period in a new window?
    	if (new_window) {
    		// update nr_full_windows
    		nr_full_windows = div64_u64((window_start - mark_start),
    						window_size);
    		if (p->ravg.active_windows < USHRT_MAX)
    			p->ravg.active_windows++;
    	}
    
    	/* Handle per-task window rollover. We don't care about the idle
    	 * task or exiting tasks. */
    	if (new_window && !is_idle_task(p) && !exiting_task(p)) {
    		u32 curr_window = 0;
    
    		if (!nr_full_windows)
    			curr_window = p->ravg.curr_window;
    		//update prev
    		p->ravg.prev_window = curr_window;
    		p->ravg.curr_window = 0;
    	}
    
    	// 根据event irq判断当前的输入,如果没有对busy造成贡献,则直接返回;
    	if (!account_busy_for_cpu_time(rq, p, irqtime, event)) {
    		/* account_busy_for_cpu_time() = 0, so no update to the
    		 * task's current window needs to be made. This could be
    		 * for example
    		 *
    		 *   - a wakeup event on a task within the current
    		 *     window (!new_window below, no action required),
    		 *   - switching to a new task from idle (PICK_NEXT_TASK)
    		 *     in a new window where irqtime is 0 and we aren't
    		 *     waiting on IO */
    
    		if (!new_window)
    			return;
    
    		/* A new window has started. The RQ demand must be rolled
    		 * over if p is the current task. */
    		if (p_is_curr_task) {
    			u64 prev_sum = 0;
    
    			/* p is either idle task or an exiting task */
    			if (!nr_full_windows) {
    				prev_sum = rq->curr_runnable_sum;
    			}
    
    			rq->prev_runnable_sum = prev_sum;
    			rq->curr_runnable_sum = 0;
    		}
    
    		return;
    	}
    
    	//对应task在当前window内启动,对类型做判断(这个是核心),然后计算时间更新
    	if (!new_window) {
    		/* account_busy_for_cpu_time() = 1 so busy time needs
    		 * to be accounted to the current window. No rollover
    		 * since we didn't start a new window. An example of this is
    		 * when a task starts execution and then sleeps within the
    		 * same window. */
    		//判断:不是中断 或者 不是idle 或者 等待IO
    		if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq))
    			delta = wallclock - mark_start;
    		else
    			delta = irqtime;
    		//换算时间增加curr上
    		delta = scale_exec_time(delta, rq);
    		rq->curr_runnable_sum += delta;
    		if (!is_idle_task(p) && !exiting_task(p))
    			p->ravg.curr_window += delta;
    
    		return;
    	}
    
    	// cur window 内task有做事情,但是传入参数并非该task,一般来说就是中断;
    	if (!p_is_curr_task) {
    		/* account_busy_for_cpu_time() = 1 so busy time needs
    		 * to be accounted to the current window. A new window
    		 * has also started, but p is not the current task, so the
    		 * window is not rolled over - just split up and account
    		 * as necessary into curr and prev. The window is only
    		 * rolled over when a new window is processed for the current
    		 * task.
    		 *
    		 * Irqtime can't be accounted by a task that isn't the
    		 * currently running task. */
    		//整体分割为两步计算,prev & curr
    		if (!nr_full_windows) {
    			/* A full window hasn't elapsed, account partial
    			 * contribution to previous completed window. */
    			delta = scale_exec_time(window_start - mark_start, rq);
    			if (!exiting_task(p))
    				p->ravg.prev_window += delta;
    		} else {
    			/* Since at least one full window has elapsed,
    			 * the contribution to the previous window is the
    			 * full window (window_size). */
    			delta = scale_exec_time(window_size, rq);
    			if (!exiting_task(p))
    				p->ravg.prev_window = delta;
    		}
    		rq->prev_runnable_sum += delta;
    
    		/* Account piece of busy time in the current window. */
    		delta = scale_exec_time(wallclock - window_start, rq);
    		rq->curr_runnable_sum += delta;
    		if (!exiting_task(p))
    			p->ravg.curr_window = delta;
    
    		return;
    	}
    
    	//运行的函数
    	if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq)) {
    		/* account_busy_for_cpu_time() = 1 so busy time needs
    		 * to be accounted to the current window. A new window
    		 * has started and p is the current task so rollover is
    		 * needed. If any of these three above conditions are true
    		 * then this busy time can't be accounted as irqtime.
    		 *
    		 * Busy time for the idle task or exiting tasks need not
    		 * be accounted.
    		 *
    		 * An example of this would be a task that starts execution
    		 * and then sleeps once a new window has begun. */
    
    		if (!nr_full_windows) {
    			/* A full window hasn't elapsed, account partial
    			 * contribution to previous completed window. */
    			delta = scale_exec_time(window_start - mark_start, rq);
    			if (!is_idle_task(p) && !exiting_task(p))
    				p->ravg.prev_window += delta;
    
    			delta += rq->curr_runnable_sum;
    		} else {
    			/* Since at least one full window has elapsed,
    			 * the contribution to the previous window is the
    			 * full window (window_size). */
    			delta = scale_exec_time(window_size, rq);
    			if (!is_idle_task(p) && !exiting_task(p))
    				p->ravg.prev_window = delta;
    
    		}
    		/*
    		 * Rollover for normal runnable sum is done here by overwriting
    		 * the values in prev_runnable_sum and curr_runnable_sum.
    		 * Rollover for new task runnable sum has completed by previous
    		 * if-else statement.
    		 */
    		rq->prev_runnable_sum = delta;
    
    		/* Account piece of busy time in the current window. */
    		delta = scale_exec_time(wallclock - window_start, rq);
    		rq->curr_runnable_sum = delta;
    		if (!is_idle_task(p) && !exiting_task(p))
    			p->ravg.curr_window = delta;
    
    		return;
    	}
    
    	//中断
    	if (irqtime) {
    		/* account_busy_for_cpu_time() = 1 so busy time needs
    		 * to be accounted to the current window. A new window
    		 * has started and p is the current task so rollover is
    		 * needed. The current task must be the idle task because
    		 * irqtime is not accounted for any other task.
    		 *
    		 * Irqtime will be accounted each time we process IRQ activity
    		 * after a period of idleness, so we know the IRQ busy time
    		 * started at wallclock - irqtime. */
    
    		BUG_ON(!is_idle_task(p));
    		mark_start = wallclock - irqtime;
    
    		/* Roll window over. If IRQ busy time was just in the current
    		 * window then that is all that need be accounted. */
    		rq->prev_runnable_sum = rq->curr_runnable_sum;
    		if (mark_start > window_start) {
    			rq->curr_runnable_sum = scale_exec_time(irqtime, rq);
    			return;
    		}
    
    		/* The IRQ busy time spanned multiple windows. Process the
    		 * busy time preceding the current window start first. */
    		delta = window_start - mark_start;
    		if (delta > window_size)
    			delta = window_size;
    		delta = scale_exec_time(delta, rq);
    		rq->prev_runnable_sum += delta;
    
    		/* Process the remaining IRQ busy time in the current window. */
    		delta = wallclock - window_start;
    		rq->curr_runnable_sum = scale_exec_time(delta, rq);
    
    		return;
    	}
    
    	BUG();
    }	
    

    细节内容在函数中注释了,这里来简单总结下:

    1. 根据不同Task类型做不同busytime时间的计算;
    2. 核心计算方式均相同,只是具体数值差异;
    3. 更新数据为
      Task中prev_window、curr_window
      rq中prev_runable_sum、curr_runnable_sum

    3.3.5 irq load 相关调用统计

    3.3.5.1 与irq相关的三个变量:

    cur_irqload:当前Task的irqload,即执行时间
    avg_irqload:当前rq的平均irqload,这个值与中断频率相关,逐步衰减,是个累加值;
    u64 irqload_ts:上次计算walt irqload的时间,通过这个值来确认中断频次;

    3.3.5.2 调用逻辑

    sched_init时 三个值被设置为0,前边已经研究过了,这东西是在中断时被调用,具体来看:

    void walt_account_irqtime(int cpu, struct task_struct *curr,
    				 u64 delta, u64 wallclock)
    {
    	struct rq *rq = cpu_rq(cpu);
    	unsigned long flags, nr_windows;
    	u64 cur_jiffies_ts;
    
    	raw_spin_lock_irqsave(&rq->lock, flags);
    
    	/*
    	 * cputime (wallclock) uses sched_clock so use the same here for
    	 * consistency.
    	 */
    	 //计算从获取wallclock到执行到这里的差值更新,即做矫正;
    	 //这里需要跟踪delta传入时值,sched_clock_cpu - irq_start_time
    	 //即delta是irq的执行时间;
    	delta += sched_clock() - wallclock;
    	cur_jiffies_ts = get_jiffies_64();
    
    	//如果是IDLE task则做walt相关计算更新,这里是获取的当前值作为wallclock,delta即irq执行time
    	if (is_idle_task(curr))
    		walt_update_task_ravg(curr, rq, IRQ_UPDATE, walt_ktime_clock(),
    				 delta);
    
    	//计算两次中断统计之间的时间,这里nr_windows是tick数
    	nr_windows = cur_jiffies_ts - rq->irqload_ts;
    
    	//这里是指这个CPU上触发中断的频率,以10个tick作为判断依据,假设HZ设置为250,则一个tick为4ms
    	if (nr_windows) {
    		if (nr_windows < 10) {//如果经过的时间差值在10以内,则avg_irqload衰减为原来的3/4
    			/* Decay CPU's irqload by 3/4 for each window. */
    			rq->avg_irqload *= (3 * nr_windows);
    			rq->avg_irqload = div64_u64(rq->avg_irqload,
    						    4 * nr_windows);
    		} else {//如果经过的时间差值超过10,则avg_irqload忽略不计,直接记为0;
    			rq->avg_irqload = 0;
    		}
    		//累加当前的irqload
    		rq->avg_irqload += rq->cur_irqload;
    		rq->cur_irqload = 0;
    	}
    
    	rq->cur_irqload += delta;
    //irqload_ts为当前值,目前搜索irqload_ts只有这两个位置有更新使用,则说明ts是指上次irq中断统计的时间	
    	rq->irqload_ts = cur_jiffies_ts;
    	raw_spin_unlock_irqrestore(&rq->lock, flags);
    }
    

    account_irq_enter_time/account_irq_exit_time ==> irq_account_irq ==> walt_account_irqtime
    这个过程还比较简单:

    1. 中断进入和退出的时候都会统计数据;
    2. 统计数据即中断执行时间;
    3. rq的时间根据中断进入的频率累加不同;
    3.3.5.3 irqload使用的第一个场景

    判断cpu的irq load情况,直接上code:

    #define WALT_HIGH_IRQ_TIMEOUT 3
    
    u64 walt_irqload(int cpu) {
    	struct rq *rq = cpu_rq(cpu);
    	s64 delta;
    	delta = get_jiffies_64() - rq->irqload_ts;
    
            /*
    	 * Current context can be preempted by irq and rq->irqload_ts can be
    	 * updated by irq context so that delta can be negative.
    	 * But this is okay and we can safely return as this means there
    	 * was recent irq occurrence.
    	 */
    //这个计算是避免被竞争抢占后delta值发生变化,至于这里为什么是3,目前还有疑惑?
            if (delta < WALT_HIGH_IRQ_TIMEOUT)
    		return rq->avg_irqload;
            else
    		return 0;
    }
    
    //这个函数是在find_best_target,即在migirate时找到下一个CPU时判断负载;
    int walt_cpu_high_irqload(int cpu) {
    	return walt_irqload(cpu) >= sysctl_sched_walt_cpu_high_irqload;//这个值默认是10ms
    }
    

    3.4 关键结构体

    1. rq //在runqueue中添加部分数据统计
    2. task_struct //在task_struct中添加对应变量
    3. ravg //与这个计算相关的结构

    3.4.1 rq

    在这里插入图片描述
    对应的结构定义:

    struct rq {
    	
    	...
    	
    #ifdef CONFIG_SCHED_WALT
    	u64 cumulative_runnable_avg;
    	u64 window_start;
    	u64 curr_runnable_sum;
    	u64 prev_runnable_sum;
    	u64 nt_curr_runnable_sum;
    	u64 nt_prev_runnable_sum;
    	u64 cur_irqload;
    	u64 avg_irqload;
    	u64 irqload_ts;
    	u64 cum_window_demand;
    #endif /* CONFIG_SCHED_WALT */
    
    	...
    
    };
    

    3.4.2 task_struct

    在这里插入图片描述

    struct task_struct {
    
    	...
    	
    #ifdef CONFIG_SCHED_WALT
    	struct ravg ravg;
    	/*
    	 * 'init_load_pct' represents the initial task load assigned to children
    	 * of this task
    	 */
    	u32 init_load_pct;
    	u64 last_sleep_ts;
    #endif
    
    	...
    }
    /* ravg represents frequency scaled cpu-demand of tasks */
    struct ravg {
    	/*
    	 * 'mark_start' marks the beginning of an event (task waking up, task
    	 * starting to execute, task being preempted) within a window
    	 *
    	 * 'sum' represents how runnable a task has been within current
    	 * window. It incorporates both running time and wait time and is
    	 * frequency scaled.
    	 *
    	 * 'sum_history' keeps track of history of 'sum' seen over previous
    	 * RAVG_HIST_SIZE windows. Windows where task was entirely sleeping are
    	 * ignored.
    	 *
    	 * 'demand' represents maximum sum seen over previous
    	 * sysctl_sched_ravg_hist_size windows. 'demand' could drive frequency
    	 * demand for tasks.
    	 *
    	 * 'curr_window' represents task's contribution to cpu busy time
    	 * statistics (rq->curr_runnable_sum) in current window
    	 *
    	 * 'prev_window' represents task's contribution to cpu busy time
    	 * statistics (rq->prev_runnable_sum) in previous window
    	 */
    	u64 mark_start; //  marks the beginning of an event (task waking up, task starting to execute, task being preempted) within a window
    	u32 sum, demand; // sum : how runable a task has benn within current window; demand: 
    	u32 sum_history[RAVG_HIST_SIZE_MAX]; // 
    	u32 curr_window, prev_window;
    	u16 active_windows;
    };
    #endif
    

    4. 附录

    4.1 linux的调度变更过程

    1. runqueue 按照优先级划分,active expored,更快速的调度;
    2. CFS 提出virtual time的概念,根据优先级换算不同的物理时间;
    3. CFS + PELT,更加合理的分配Task以及迁移Task;
    4. CFS + WALT,响应更加迅速,更适合用于手机这类设备,可以在性能和功耗之间做比较好的平衡;

    4.2 待补充内容

    1. update history code [done]
    2. irq 调用过程 [done]
    3. 对于更新数据的使用==>计划跟踪top过程,希望明天可以初步完成
    【作者】张昺华
    【大饼教你学系列】https://edu.csdn.net/course/detail/10393
    【新浪微博】 张昺华--sky
    【twitter】 @sky2030_
    【微信公众号】 张昺华
    本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利.
  • 相关阅读:
    HDU 4287 Intelligent IME 第37届ACM/ICPC天津赛区网络赛1010题 (水题)
    HDU 4267 A Simple Problem with Integers 第37届ACM/ICPC长春赛区网络赛1001题 (树状数组)
    HDU 4277 USACO ORZ 第37届ACM/ICPC长春赛区网络赛1011题(搜索)
    HDU 4099 Revenge of Fibonacci(字典树)
    HDU 2802 F(N)(简单题,找循环解)
    HDU 4282 A very hard mathematic problem 第37届ACM/ICPC长春赛区网络赛1005题 (暴力)
    HDU 4268 Alice and Bob 第37届ACM/ICPC长春赛区网络赛1002题 (贪心+multiset)
    HDU 3501 Calculation 2(欧拉函数的引申)
    HDU 4278 Faulty Odometer 第37届ACM/ICPC天津赛区网络赛1001题 (简单水题)
    HDU 4279 Number 第37届ACM/ICPC天津赛区网络赛1002题 (简单规律题)
  • 原文地址:https://www.cnblogs.com/sky-heaven/p/13901074.html
Copyright © 2011-2022 走看看