内核中current实现

zoukankan html css js c++ java

内核中current实现
一、当前进程current

在内核中，current绝对是一个出镜率非常高的变量，在几乎所有的系统调用中都会用到该变量。由于该变量被使用的频率比较高，所以它的实现要尽可能的快速高效。在最早的内核版本中，这个实现在内核的不同版本中一直在变化，从这个变量也可以引申出一些有意思的问题。

二、早期内核实现
在1.0内核版本中，current定义为一个全局变量，初始值为init_task，在每次执行进程切换时，更新这个全局变量的值，其它地方通过对区局变量的引用来获得当前进程。下面是1.0版本中current更新代码，这里拷贝一份过来

357#define switch_to(tsk)

358__asm__("cmpl %%ecx,_current "

359 "je 1f "

360 "cli "

361 "xchgl %%ecx,_current "

362 "ljmp %0 "

363 "sti "

364 "cmpl %%ecx,_last_task_used_math "

365 "jne 1f "

366 "clts "

367 "1:"

368 : /* no output */

369 :"m" (*(((char *)&tsk->tss.tr)-4)),

370 "c" (tsk)

371 :"cx")

其中使用内嵌汇编在switch时相当于挂了个钩子，实时更新current变量的值。

老实讲，这种方法获得current的效率是相当高的，在386体系下，如果要引用这个变量的值，使用一条汇编指令即可完成。但是这种方法有一个非常严重的缺陷，就是不支持多核。假设系统中有多个CPU，而系统中的全局变量只有一个，所以每个cpu上看到的current相同，这样多核就没有意义。

三、2.4.0对多核的实现

内核2.0版本的一个重要特性就是支持多核，所以看下2.0内核中对于current的实现。看了之后发现2.0对于多核的支持可能只限于386和sparc体系结构，而且实现的方法太过诡异，那我们只好直接掠过了。看下更为近代的2.4.0版本的实现

6static inline struct task_struct * get_current(void) 7{ 8 struct task_struct *current; 9 __asm__("andl %%esp,%0; ":"=r" (current) : "0" (~8191UL)); 10 return current; 11 } 12 13#define current get_current()

这里获得current同样只是使用了一条指令，这也是我们常见的实现描述方式。利用了内核中内核态esp指针的高端是内核态堆栈，底端是task_struct结构的特性。我想大部分看过早期linux内核书籍的人都会对这种结构有印象。

这里只是一个现象，因为这里依然看不出来多核下的这段代码在不同的CPU上如何获得不同的当前变量值。

1、386体系结构下硬件任务管理支持

386体系结构加入了特权级功能，用户态任务运行在低特权级别(3级)上，内核级操作系统运行在高权限级别上(0级)，并且在特定场合下需要进行切换，例如中断指令，硬件中断，软件除零异常、内存访问异常等。此时就需要切换到高权限级别运行。

"运行"涉及到两个基本的概念，一个是"堆栈"，一个是"指令地址"。中断的发生是不可预测的，当中断发生之后，内核如何找到特定权限级别状态下执行必须的堆栈和指令地址呢？

386系统为定义了一个专门的内核态寄存器(即用户态没有权限修改该寄存器内容)TR，该寄存器和我们常见的段寄存器(segment)例如CS、FS等段寄存器一样，指向了一个GDT中一个描述符，这个描述符的属性为Task Status Segment，和这个描述符并列的还有中断门、陷阱门、系统调用门等，它们在GDT中通过特定的bit组合表示自己的类型。TR寄存器指向了GDT中一个特定的TSS结构，这个结构就是intel为进程准备的context信息。按照intel CPU的这种设计初衷，当内核需要进行任务切换的时候，在GDT中设置一个特定的任务门(Task Gate)，通过段间跳转指令(call，jmp,iret等)来索引这个段描述符，当CPU发现新的段描述符是一个task gate时，执行当前进程TSS的保存，将新task gate中指定的TSS段地址加载到TR寄存器中，并设置TSS中的Previous字段。这里有一个小的细节，在段间跳转时，需要段和段偏移量个操作符，但是如果段描述符的类型是一个task gate，此时偏移量字段被忽略。intel指令集中对于call指令的描述中可以看到相关内容的说明：

Executing a task switch with the CALL instruction is similar to executing a call through a call gate. The target operand specifies the segment selector of the task gate for the new task activated by the switch (the offset in the target operand is ignored).

这个TSS结构的特定内容由intel开发规范规定，大致来看，文档描述为这样的一个结构。

结构中为0-2三种特权级都定义了各自的堆栈地址(SS和ESP绝对一个堆栈地址，此处依然使用段结构)。

当发生特权级切换时(例如中断发生,调用 int 指令)，此时硬件会根据TR-->>TSS结构中的信息自动刷新CPU寄存器的内容，这里最为重要的就是EIP和ESP寄存器了。

2、当中断发生时CPU的执行流程

An interrupt gate or trap gate references an exception- or interrupt-handler procedure that runs in the context of the currently executing task (see Figure 6-3). The segment selector for the gate points to a segment descriptor for an executable code segment in either the GDT or the current LDT. The offset field of the gate descriptor points to the beginning of the exception- or interrupt-handling procedure.

When the processor performs a call to the exception- or interrupt-handler procedure:

? If the handler procedure is going to be executed at a numerically lower privilege level, a stack switch occurs.When the stack switch occurs: a. The segment selector and stack pointer for the stack to be used by the handler are obtained from the TSS for the currently executing task. On this new stack, the processor pushes the stack segment selector and stack pointer of the interrupted procedure. b. The processor then saves the current state of the EFLAGS, CS, and EIP registers on the new stack (seeFigures 6-4). c. If an exception causes an error code to be saved, it is pushed on the new stack after the EIP value. ? If the handler procedure is going to be executed at the same privilege level as the interrupted procedure: a. The processor saves the current state of the EFLAGS, CS, and EIP registers on the current stack (see Figures 6-4). b. If an exception causes an error code to be saved, it is pushed on the current stack after the EIP value.

当中断或者切换发生时，目的EIP地址通过描述符中的段选择符和段内偏移字段一起决定跳转的目的地址，而当前任务TSS中的SS和ESP决定了切换后堆栈的位置。下面是一些门描述符的内部定义格式，其中的Segment Select选择GDT中一个代码段描述符，Offset指定基地址基础上的32bit偏移量。

4、linux内核中的做法

linux内核中为每个CPU设置了一个私有的TSS结构(而不是intel设计CPU时的每个task一个TSS结构)，在进程切换时更新这个TSS中的字段，以386实现为例

627void __switch_to(struct task_struct *prev_p, struct task_struct *next_p) 628{ 629 struct thread_struct *prev = &prev_p->thread, 630 *next = &next_p->thread; 631 struct tss_struct *tss = init_tss + smp_processor_id(); 632 633 unlazy_fpu(prev_p); 634 635 /* 636 * Reload esp0, LDT and the page table pointer: 637 */ 638 tss->esp0 = next->esp0;

当进程切换之后，使用下一个进程的esp0寄存器来更新tss中的esp0值，而每个进程的esp0为自己的堆栈栈顶地址。

esp0为一个线程私有结构，其初始值在do_fork-->>copy_thread

529int copy_thread(int nr, unsigned long clone_flags, unsigned long esp, 530 unsigned long unused, 531 struct task_struct * p, struct pt_regs * regs) 532{ 533 struct pt_regs * childregs; 534 535 childregs = ((struct pt_regs *) (THREAD_SIZE + (unsigned long) p)) - 1; 536 struct_cpy(childregs, regs); 537 childregs->eax = 0; 538 childregs->esp = esp; 539 540 p->thread.esp = (unsigned long) childregs; 内核态当前栈顶位置，整个pt_regs结构在进入内核中通过SAVE_ALL(linux-2.6.21archi386kernelentry.S)保存。 541 p->thread.esp0 = (unsigned long) (childregs+1); esp0为整个内核态堆栈(THREAD_SIZE字节)最顶端。 542 543 p->thread.eip = (unsigned long) ret_from_fork; 544 545 savesegment(fs,p->thread.fs); 546 savesegment(gs,p->thread.gs); 547 548 unlazy_fpu(current); 549 struct_cpy(&p->thread.i387, &current->thread.i387); 550 551 return 0; 552}

总起来说，它实现的方法是在进程切换时，将当前进程(current)的栈顶位置保存在了TR寄存器指向的TSS段的esp0字段中,当中断发生时，CPU保证进入内核后堆栈的位置在TSS中描述的位置，进而可以得到当前CPU的current的task_struct结构。

5、smp_processor_id的实现

在前一节中还有一个问题没有说明清楚，在获得当前CPU使用的tss字段时，并没有通过TR寄存器指向的tss段来获得，而是通过了smp_processor_id宏获得CPU的编号，以该编号为下标从一个内存数组中获得该CPU对应的TSS结构，那么一个CPU是如何获得自己的CPU编号呢(所有的CPU执行相同的指令，但是返回值不同)？

#define smp_processor_id() (current->processor)

也就是每个进程的task_struct结构中保存了自己所在的处理器的编号，它这个编号又是从哪里来的呢？

同样是在schedule函数中

508 asmlinkage void schedule(void) 509{ 510 struct schedule_data * sched_data; 511 struct task_struct *prev, *next, *p; 512 struct list_head *tmp; 513 int this_cpu, c; 514 515 if (!current->active_mm) BUG(); 516 need_resched_back: 517 prev = current; 518 this_cpu = prev->processor;

……

587#ifdef CONFIG_SMP 588 next->has_cpu = 1; 589 next->processor = this_cpu; 590#endif

当一个进程获得调度权时，它继承前一个task_struct的processor字段。由于每个cpu上都有一个0号进程(或者说idle task),它们的初始值在每个CPU启动之后由主CPU（BootStrap Processor）逐个赋值，从而一脉相承，连绵不绝。

四、2.6.0对多核中current的实现

从之前的实现可以看到一个问题，那就是每个task_struct结构都需要保留一个processor字段来表示自己在哪个CPU上，显得有些浪费，这个字段也只是对于在运行的task才有意义。理论上说，一个cpu不依赖于task_struct结构就应该可以获得自己所在CPU编号。

1、实现代码

static __always_inline struct task_struct *get_current(void) { return read_pda(pcurrent); } #define current get_current()

#define read_pda(field) pda_from_op("mov",field)

/* This variable is never instantiated. It is only used as a stand-in for the real per-cpu PDA memory, so that gcc can understand what memory operations the inline asms() below are performing. This eliminates the need to make the asms volatile or have memory clobbers, so gcc can readily analyse them. */ extern struct i386_pda _proxy_pda;

#define pda_from_op(op,field) ({ typeof(_proxy_pda.field) ret__; switch (sizeof(_proxy_pda.field)) { case 1: asm(op "b %%fs:%c1,%0" : "=r" (ret__) : "i" (pda_offset(field)), "m" (_proxy_pda.field)); break; case 2: asm(op "w %%fs:%c1,%0" : "=r" (ret__) : "i" (pda_offset(field)), "m" (_proxy_pda.field)); break; case 4: asm(op "l %%fs:%c1,%0" : "=r" (ret__) : "i" (pda_offset(field)), "m" (_proxy_pda.field)); break; default: __bad_pda_field(); } ret__; })

这里的宏展开之后只有一条指令，就是从fs段基础位置便宜特定字段之后获得一个变量，或者说，这里假设一个CPU的fs寄存器指向的是一个CPU私有的

struct i386_pda { struct i386_pda *_pda; /* pointer to self */ int cpu_number; struct task_struct *pcurrent; /* current process */ struct pt_regs *irq_regs; };

每个CPU自己当前运行的进程task_struct和自己的逻辑编号都在其中。和之前相比，这种引用翻了过来。前一个实现中是task_struct中保存CPU编号，现在是CPU信息中包含了current和cpu编号。

2、fs指向内容的初始化

cpu_init-->>cpu_set_gdt-->>set_kernel_fs

static inline void set_kernel_fs(void) { /* Set %fs for this CPU's PDA. Memory clobber is to create a barrier with respect to any PDA operations, so the compiler doesn't move any before here. */ asm volatile ("mov %0, %%fs" : : "r" (__KERNEL_PDA) : "memory"); }

cpu_init--->>init_gdt

pda = cpu_pda(cpu);

……

pack_descriptor((u32 *)&gdt[GDT_ENTRY_PDA].a, (u32 *)&gdt[GDT_ENTRY_PDA].b, (unsigned long)pda, sizeof(*pda) - 1, 0x80 | DESCTYPE_S | 0x2, 0); /* present read-write data segment */ memset(pda, 0, sizeof(*pda)); pda->_pda = pda; pda->cpu_number = cpu; pda->pcurrent = idle;

也就是说，在CPU启动之后，就为该CPU分配一个PDA结构，并让该CPU的fs指向该结构的起始位置。在进入内核之后，SAVE_ALL寄存器会和更新内核代码段一样更新fs的值linux-2.6.21archi386kernelentry.S

#define SAVE_ALL

……

movl $(__USER_DS), %edx; movl %edx, %ds; movl %edx, %es; movl $(__KERNEL_PDA), %edx; movl %edx, %fs

五、多核CPU的启动和编号分配

在intel手册中说明了多核启动有一定的协议MP initialization protocol，当系统中有多核存在时，只有一个主引导CPU(BSP BootStrap Processor)有效，其它的CPU处于Application Processor状态。当引导完成之后，此时主CPU会为各个CPU分配编号并让它们各自启动。

在系统最开始启动时，内核可以通过BIOS提供的信息来知道系统中共有多少个在线的CPU信息。内核中有大量acpi(Advanced Configuration and Power Interface (ACPI))相关的代码，仅仅对于多核启动来说，关键的代码流程为

acpi_initialize_tables-->>acpi_os_get_root_pointer--->>acpi_find_rsdp

unsigned long __init acpi_find_rsdp(void) { unsigned long rsdp_phys = 0; if (efi_enabled) { if (efi.acpi20 != EFI_INVALID_TABLE_ADDR) return efi.acpi20; else if (efi.acpi != EFI_INVALID_TABLE_ADDR) return efi.acpi; } /* * Scan memory looking for the RSDP signature. First search EBDA (low * memory) paragraphs and then search upper memory (E0000-FFFFF). */ rsdp_phys = acpi_scan_rsdp(0, 0x400); if (!rsdp_phys) rsdp_phys = acpi_scan_rsdp(0xE0000, 0x20000); return rsdp_phys; }

static unsigned long __init acpi_scan_rsdp(unsigned long start, unsigned long length) { unsigned long offset = 0; unsigned long sig_len = sizeof("RSD PTR ") - 1; /* * Scan all 16-byte boundaries of the physical memory region for the * RSDP signature. */ for (offset = 0; offset < length; offset += 16) { if (strncmp((char *)(phys_to_virt(start) + offset), "RSD PTR ", sig_len)) continue; return (start + offset); } return 0; }

也就是从约定地址中搜索特定字段并进行校验，如果配置了efi，则直接使用BIOS传过来的结构即可。

六、根本的限制

其实从根本上看，如果每个CPU指向相同的代码但是可以获得不同的值，那么必须有一个CPU私有的内容来实现，即相同的代码对不同的CPU来说是不同的。对于CPU来说，它私有的内容就是自己的寄存器组，每个CPU都有自己的寄存器组。对于386来说，最早使用的是TR寄存器，之后使用的是fs寄存器。

再看下其它的体系结构，PowerPC使用的是r1寄存器

static inline struct thread_info *current_thread_info(void) { register unsigned long sp asm("r1"); /* gcc4, at least, is smart enough to turn this into a single * rlwinm for ppc32 and clrrdi for ppc64 */ return (struct thread_info *)(sp & ~(THREAD_SIZE-1)); }

alpha体系结构使用的是$8寄存器

/* How to get the thread information struct from C. */ register struct thread_info *__current_thread_info __asm__("$8"); #define current_thread_info() __current_thread_info

这一点和很多编译器支持的线程私有变量实现方式类似，如果有兴趣的话可以看下gcc关于线程私有变量的实现，它本质上也是使用了不同线程使用不同的寄存器组这个事实来实现的该功能。

七、glibc及内核对于线程私有变量的实现

glibc-2.6 ptlsysdepsi386 ls.h

/* Code to initially initialize the thread pointer. This might need special attention since 'errno' is not yet available and if the operation can cause a failure 'errno' must not be touched. */ # define TLS_INIT_TP(thrdescr, secondcall) ({ void *_thrdescr = (thrdescr); tcbhead_t *_head = _thrdescr; union user_desc_init _segdescr; int _result; _head->tcb = _thrdescr; /* For now the thread descriptor is at the same address. */ _head->self = _thrdescr; /* New syscall handling support. */ INIT_SYSINFO; /* The 'entry_number' field. Let the kernel pick a value. */ if (secondcall) _segdescr.vals[0] = TLS_GET_GS () >> 3; else _segdescr.vals[0] = -1; /* The 'base_addr' field. Pointer to the TCB. */ _segdescr.vals[1] = (unsigned long int) _thrdescr; /* The 'limit' field. We use 4GB which is 0xfffff pages. */ _segdescr.vals[2] = 0xfffff; /* Collapsed value of the bitfield: .seg_32bit = 1 .contents = 0 .read_exec_only = 0 .limit_in_pages = 1 .seg_not_present = 0 .useable = 1 */ _segdescr.vals[3] = 0x51; /* Install the TLS. */ asm volatile (TLS_LOAD_EBX "int $0x80 " TLS_LOAD_EBX : "=a" (_result), "=m" (_segdescr.desc.entry_number) : "0" (__NR_set_thread_area), TLS_EBX_ARG (&_segdescr.desc), "m" (_segdescr.desc)); 通过get_thread_area申请并安装一个内核态的描述符，该描述符指向用户态_segdescr.desc地址，该描述符在的数值通过_segdescr.desc.entry_number返回，因为用户态不能操作内核的GDT表内容。 if (_result == 0) /* We know the index in the GDT, now load the segment register. The use of the GDT is described by the value 3 in the lower three bits of the segment descriptor value. Note that we have to do this even if the numeric value of the descriptor does not change. Loading the segment register causes the segment information from the GDT to be loaded which is necessary since we have changed it. */ TLS_SET_GS (_segdescr.desc.entry_number * 8 + 3); 将内核返回的GDT索引赋值该gs寄存器。 _result == 0 ? NULL : "set_thread_area failed when setting up thread-local storage "; }) /* Return the address of the dtv for the current thread. */ # define THREAD_DTV() ({ struct pthread *__pd; THREAD_GETMEM (__pd, header.dtv); })
查看全文

相关阅读:
第12组(78) Beta冲刺 (2/5)（组长）
第12组 Beta冲刺 (1/6)（组长）
软工实践个人总结
 第 02 组每周小结 (3/3)
第02组每周小结（2/3）
第02组每周小结 (1/3)
第02组Beta冲刺总结
 第02组Beta冲刺（5/5）
第02组Beta冲刺（4/5）
第02组Beta冲刺（3/5）

原文地址：https://www.cnblogs.com/tsecer/p/10487596.html