Linux进程调度与抢占

zoukankan html css js c++ java

Linux进程调度与抢占
一、linux内核抢占介绍

1.抢占发生的必要条件

a.preempt_count抢占计数必须为0，不为0说明其它地方调用了禁止抢占的函数，比如spin_lock系列函数。
b.中断必须是使能的状态，因为抢占动作要依赖中断。

preempt_schedule()具体源码实现参考如下：
asmlinkage __visible void __sched notrace preempt_schedule(void) { /* * If there is a non-zero preempt_count or interrupts are disabled, * we do not want to preempt the current task. Just return.. */ /*preempt_disable()会增加preempt_count的计数*/ if (likely(!preemptible())) return; preempt_schedule_common(); } #define preemptible() (preempt_count() == 0 && !irqs_disabled())

View Code
2.spin_lock系列函数

a.spin_lock()会调用preempt_disable函数关闭抢占.
b.spin_lock_irq()会调用spin_lock()函数和local_irq_disable()函数（关闭中断）
c.spin_lock_irqsave()会调用spin_lock()函数和local_irq_save()函数（关闭中断，同时保存cpu对中断的屏蔽状态）

spin_lock()：
/*include/linux/spinlock.h*/ static __always_inline void spin_lock(spinlock_t *lock) { raw_spin_lock(&lock->rlock); } #define raw_spin_lock(lock) _raw_spin_lock(lock) /*kernel/locking/spinlock.c*/ void __lockfunc _raw_spin_lock(raw_spinlock_t *lock) { __raw_spin_lock(lock); } /*include/linux/spinlock_api_smp.h*/ static inline void __raw_spin_lock(raw_spinlock_t *lock) { preempt_disable(); /*调用禁止抢占函数*/ spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); }

View Code
spin_unlock():
/*kernel/locking/spinlock.c*/ void __lockfunc _raw_spin_unlock(raw_spinlock_t *lock) { __raw_spin_unlock(lock); } /*include/linux/spinlock_api_smp.h*/ static inline void __raw_spin_unlock(raw_spinlock_t *lock) { spin_release(&lock->dep_map, 1, _RET_IP_); do_raw_spin_unlock(lock); preempt_enable(); }

View Code
preempt_enable():
/*include/linux/preempt.h*/ #define preempt_enable() do { barrier(); if (unlikely(preempt_count_dec_and_test())) /*这里提供了一个抢占点__preempt_schedule()，其它高优先级的进程可直接抢占*/ __preempt_schedule(); } while (0)

View Code
由上可知，spin_unlock()系列函数可以直接触发内核抢占，因为它里面提供可抢占点。

3.preempt_disable()和local_irq_disable()的区别

由抢占发生的必要条件可知两个函数都可以关闭抢占。区别不在于关抢占和关中断函数上，而是在对应的开抢占和开中断的函数上，也就是
preempt_enable()函数local_irq_enable()函数。preempt_enable()会是能抢占并提供抢占点，而local_irq_enable()仅仅是开中断(是能抢占)，
并没有提供抢占点。

4.抢占点可能是：时钟tick中断处理返回、中断返回、软中断结束、yield()(进程调用它放弃CPU)等等多种情况。

5.注意spin_lock系列函数关闭了抢占，但是并没有关闭调度!

6.原子上下文中不可睡眠，可以打开内核中的CONFIG_DEBUG_ATOMIC_SLEEP选项，运行时一旦检测出在原子上下文中可能睡眠就会打印栈回溯信息。

7.进程的优先级使用nice值表示。

二、进程调度

1.目前4.14.35内核中只有下列sched_class：
fair_sched_class: .next = idle_sched_class rt_sched_class : .next = fair_sched_class dl_sched_class : .next = rt_sched_class idle_sched_class: .next = NULL stop_sched_class: .next = dl_sched_class

View Code
所有的调度类构成一个单链表：
stop_sched_class --> dl_sched_class --> rt_sched_class --> fair_sched_class --> idle_sched_class --> NULL

View Code
#ifdef CONFIG_SMP #define sched_class_highest (&stop_sched_class) #else #define sched_class_highest (&dl_sched_class) #endif #define for_each_class(class) for (class = sched_class_highest; class; class = class->next)

View Code
SCHED_NORMAL：普通的分时进程，使用的fair_sched_class调度类

SCHED_FIFO：先进先出的实时进程，使用的是rt_sched_class调度类。
当调用程序把CPU分配给进程的时候，它把该进程描述符保留在运行队列链表的当前位置。使用此调度策略的进程一旦使用CPU则一直运行。如
果没有其他可运行的更高优先级实时进程，进程就会继续使用CPU，想用多久就用多久，即使还有其他具有相同优先级的实时进程处于可运行状态。

SCHED_RR：时间片轮转的实时进程，使用的rt_sched_class调度类。
当调度程序把CPU分配给进程的时候，它把该进程的描述符放在运行队列链表的末尾。这种策略保证对所有具有相同优先级的SCHED_RR实时进程
进行公平分配CPU时间。

SCHED_BATCH：是SCHED_NORMAL的分化版本，使用的fair_shed_class调度类。
采用分时策略，根据动态优先级，分配CPU资源。在有实时进程的时候，实时进程优先调度。但针对吞吐量优化，除了不能抢占外与常规进程一
样，允许任务运行更长时间，更好使用高速缓存，适合于成批处理的工作。

SCHED_IDLE：优先级最低，在系统空闲时运行，使用的是idle_sched_class调度类，给0号进程使用。

SCHED_DEADLINE：新支持的实时进程调度策略，使用的是dl_sched_class调度类。
针对突发型计算，并且对延迟和完成时间敏感的任务使用，基于EDF（earliest deadline first）。

2.调度类struct sched class
struct sched_class { const struct sched_class *next; void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); void (*yield_task) (struct rq *rq); bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); /* * It is the responsibility of the pick_next_task() method that will * return the next task to call put_prev_task() on the @prev task or * something equivalent. * * May return RETRY_TASK when it finds a higher prio class has runnable tasks. */ struct task_struct * (*pick_next_task) (struct rq *rq, struct task_struct *prev, struct rq_flags *rf); void (*put_prev_task) (struct rq *rq, struct task_struct *p); #ifdef CONFIG_SMP int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags); void (*migrate_task_rq)(struct task_struct *p); void (*task_woken)(struct rq *this_rq, struct task_struct *task); void (*set_cpus_allowed)(struct task_struct *p, const struct cpumask *newmask); void (*rq_online)(struct rq *rq); void (*rq_offline)(struct rq *rq); #endif void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p, int queued); void (*task_fork) (struct task_struct *p); void (*task_dead) (struct task_struct *p); /* * The switched_from() call is allowed to drop rq->lock, therefore we * cannot assume the switched_from/switched_to pair is serliazed by * rq->lock. They are however serialized by p->pi_lock. */ void (*switched_from) (struct rq *this_rq, struct task_struct *task); void (*switched_to) (struct rq *this_rq, struct task_struct *task); void (*prio_changed) (struct rq *this_rq, struct task_struct *task, int oldprio); unsigned int (*get_rr_interval) (struct rq *rq, struct task_struct *task); void (*update_curr) (struct rq *rq); #define TASK_SET_GROUP 0 #define TASK_MOVE_GROUP 1 #ifdef CONFIG_FAIR_GROUP_SCHED void (*task_change_group) (struct task_struct *p, int type); #endif };

View Code
next: 指向下一个调度类，用于在函数pick_next_task、check_preempt_curr、set_rq_online、set_rq_offline中遍历整个调度类，根据调度
类的优先级选择调度类。
优先级为: stop_sched_class-->dl_sched_class-->rt_sched_class-->fair_sched_class-->idle_sched_class
enqueue_task: 将任务加入到调度类中
dequeue_task: 将任务从调度类中移除
yield_task/yield_to_task: 主动放弃CPU
check_preempt_curr: 检查当前进程是否可被强占
pick_next_task: 从调度类中选出下一个要运行的进程
put_prev_task: 将进程放回到调度类中
select_task_rq: 为进程选择一个合适的cpu的运行队列
migrate_task_rq: 迁移到另外的cpu运行队列
pre_schedule: 调度以前调用
post_schedule: 通知调度器完成切换
task_woken: 用于进程唤醒
set_cpus_allowed: 修改进程cpu亲和力
affinityrq_online: 启动运行队列
rq_offline:关闭运行队列
set_curr_task: 当进程改变调度类或者进程组时被调用
task_tick: 将会引起进程切换，驱动运行running强占,由time_tick调用
task_fork: 进程创建时调用，不同调度策略的进程初始化不一样
task_dead: 进程结束时调用
switched_from/switched_to:进程改变调度器时使用
prio_changed: 改变进程优先级.

3.调度的触发

调度的触发主要有两种方式：

(1)一种是本地定时中断触发调用scheduler_tick()函数，然后使用当前运行进程的调度类中的task_tick.
(2)另外一种则是主动调用schedule().
不管是哪一种最终都会调用到__schedule函数，该函数调用pick_netx_task，通过(rq->nr_running==rq->cfs.h_nr_running)判断出如果当前
运行队列中的进程都在cfs调度器中，则直接调用cfs的调度类（内核代码里面这一判断使用了likely说明大部分情况都是满足该条件的）。如
果运行队列不都在cfs中，则通过优先级stop_sched_class-->dl_sched_class-->rt_sched_class-->fair_sched_class-->idle_sched_class
遍历选出下一个需要运行的进程，然后进程任务切换。

4.发生调度的时机

处于TASK_RUNNING状态的进程才会被进程调度器选择，其他状态不会进入调度器，系统发生调度的时机如下：
a.调用cond_resched()时
b.显式调用schedule()时
c.从中断上下文返回时
当内核开启抢占时，会多出几个调度时机：
d.在系统调用中或者中断下文中调用preemt_enable()时

5.__schedule()实现
TODO：分析它

6.CFS(Completely Fair Scheduler)调度

该部分代码位于linux/kernel/sched/fair.c中，定义了const struct sched_classfair_sched_class，这个是CFS的调度类定义的对象。其中
基本包含了CFS调度的所有实现。

CFS实现三个调度策略：
SCHED_NORMAL：这个调度策略是被常规任务使用
SCHED_BATCH：这个策略不像常规的任务那样频繁的抢占，以牺牲交互性为代价下，因而允许任务运行更长的时间以更好的利用缓存，这种策略
适合批处理。
SCHED_IDLE：这是nice值甚至比19还弱，但是为了避免陷入优先级导致问题，这个问题将会死锁这个调度器，因而这不是一个真正空闲定时调
度器。

CFS调度类fair_sched_class：
enqueue_task()：当任务进入runnable状态，这个回调将把这个任务的调度实体（entity）放入红黑树并且增加nr_running变量的值。
dequeue_task()：当任务不再是runnable状态，这个回调将会把这个任务的调度实体从红黑树中取出，并且减少nr_running变量的值。
yield_task()：除非compat_yield sysctl是打开的，这个回调函数基本上就是一个dequeue后跟一个enqueue，这那种情况下，他将任务的调度
实体放入红黑树的最右端
check_preempt_curr()：这个回调函数是检查一个任务进入runnable状态是否应该抢占当前运行的任务。
pick_next_task()：这个回调函数选出下一个最合适运行的任务。
set_curr_task()：当任务改变他的调度类或者改变他的任务组，将调用该回调函数。
task_tick()：这个回调函数大多数是被time tick调用。它可能引起进程切换，这就驱动了运行时抢占。
/* * 一个调度实体（红黑树的一个节点），其包含一组或一个指定的进程，包含一个自己的运行队列， * 一个父亲指针，一个指向需要调度的队列. */ struct sched_entity { /* For load-balancing: */ struct load_weight load; /*权重，在数组prio_to_weight[]包含优先级转权重的数值*/ struct rb_node run_node; /*实体在红黑树对应的节点信息*/ struct list_head group_node; /*实体所在的进程组*/ unsigned int on_rq; /*实体是否处于红黑树运行队列中*/ u64 exec_start; /*开始运行时间*/ u64 sum_exec_runtime; /*总运行时间*/ /* 虚拟运行时间，在时间中断或者任务状态发生改变时会更新. 其会不停的增长，增长速度与load权重成反比，load越高，增长速度越慢，就越可能处于红黑树最左边被调度。每次时钟中断都会修改其值，具体见calc_delta_fair()函数 */ u64 vruntime; /*进程在切换进cpu时的sum_exec_runtime值*/ u64 prev_sum_exec_runtime; /*此调度实体中进程移到其他cpu组的数量*/ u64 nr_migrations; struct sched_statistics statistics; #ifdef CONFIG_FAIR_GROUP_SCHED int depth; /* 父亲调度实体指针，如果是进程则指向其运行队列的调度实体，如果是进程组则指向其上一个进程组的调度实体，在set_task_rq函数中设置。 */ struct sched_entity *parent; /* rq on which this entity is (to be) queued: */ struct cfs_rq *cfs_rq; /*实体所处红黑树运行队列*/ /* rq "owned" by this entity/group: */ struct cfs_rq *my_q; /*实体的红黑树运行队列，如果为NULL表明其是一个进程，若非NULL表明其是调度组*/ #endif #ifdef CONFIG_SMP /* * Per entity load average tracking. * * Put into separate cache line so it does not * collide with read-mostly values above. */ struct sched_avg avg ____cacheline_aligned_in_smp; #endif };

View Code
load
指定了权重, 决定了各个实体占队列总负荷的比重, 计算负荷权重是调度器的一项重任, 因为CFS所需的虚拟时钟的速度最终依赖于负荷, 权
重通过优先级转换而成，是vruntime计算的关键
run_node
调度实体在红黑树对应的结点信息, 使得调度实体可以在红黑树上排序
sum_exec_runtime
记录程序运行所消耗的CPU时间, 以用于完全公平调度器CFS
on_rq
调度实体是否在就绪队列上接受检查, 表明是否处于CFS红黑树运行队列中，需要明确一个观点就是，CFS运行队列里面包含有一个红黑树，但
这个红黑树并不是CFS运行队列的全部，因为红黑树仅仅是用于选择出下一个调度程序的算法。很简单的一个例子，普通程序运行时，其并不在
红黑树中，但是还是处于CFS运行队列中，其on_rq为真。只有准备退出、即将睡眠等待和转为实时进程的进程其CFS运行队列的on_rq为假。
vruntime
虚拟运行时间，调度的关键，其计算公式：一次调度间隔的虚拟运行时间 = 实际运行时间 * (NICE_0_LOAD / 权重)。可以看出跟实际运行时
间和权重有关，红黑树就是以此作为排序的标准，优先级越高的进程在运行时其vruntime增长的越慢，其可运行时间相对就长，而且也越有可
能处于红黑树的最左结点，调度器每次都选择最左边的结点为下一个调度进程。注意其值为单调递增，在每个调度器的时钟中断时当前进程的
虚拟运行时间都会累加。单纯的说就是进程们都在比谁的vruntime最小，最小的将被调度。
cfs_rq
此调度实体所处于的CFS运行队列
my_q
如果此调度实体代表的是一个进程组，那么此调度实体就包含有一个自己的CFS运行队列，其CFS运行队列中存放的是此进程组中的进程，这些
进程就不会在其他CFS运行队列的红黑树中被包含(包括顶层红黑树也不会包含他们，他们只属于这个进程组的红黑树)。
sum_exec_runtime
跟踪运行时间是由update_curr不断累积完成的。内核中许多地方都会调用该函数, 例如, 新进程加入就绪队列时, 或者周期性调度器中. 每次
调用时, 会计算当前时间和exec_start之间的差值, exec_start则更新到当前时间. 差值则被加到sum_exec_runtime.
在进程执行期间虚拟时钟上流逝的时间数量由vruntime统计。
在进程被撤销时, 其当前sum_exec_runtime值保存到prev_sum_exec_runtime, 此后, 进程抢占的时候需要用到该数据, 但是注意, 在prev_sum_exec_runtime
中保存了sum_exec_runtime的值, 而sum_exec_runtime并不会被重置, 而是持续单调增长。

每一个进程的task_struct中都嵌入了sched_entry对象，所以进程是可调度的实体，但是可调度的实体不一定是进程，也可能是进程组。

7.CFS调度总结：

Tcik中断，主要会更新调度信息，然后调整当前进程在红黑树中的位置。调整完成以后如果当前进程不再是最左边的叶子，就标记为Need_resched
标志，中断返回时就会调用scheduler()完成切换、否则当前进程继续占用CPU。从这里可以看出CFS抛弃了传统时间片概念。Tick中断只需要更新红黑树。

红黑树键值即为vruntime，该值通过调用update_curr函数进行更新。这个值为64位的变量，会一直递增，__enqueue_entity中会将vruntime作为键值将
要入队的实体插入到红黑树中。__pick_first_entity会将红黑树中最左侧即vruntime最小的实体取出。

优秀文章：

Linux 2.6 Completely Fair Scheduler 内幕： https://www.ibm.com/developerworks/cn/linux/l-completely-fair-scheduler/index.html
查看全文

相关阅读:
在sql server中怎样获得正在执行的Sql查询
 在windows中使用VMWare安装Mac OS 10.7
Scrspy 命令
 Windows Service 小品
 线程同步（一）
线程基础必知必会（二）
线程基础必知必会（一）
准备工作与简介
 Python 正则表达式急速入门
 SQL Server 每日一题--每月销售额

原文地址：https://www.cnblogs.com/hellokitty2/p/10741600.html