zoukankan      html  css  js  c++  java
  • 阅读笔记:x86系统调用入门

    阅读笔记:x86系统调用入门

    原作者: Russ Blaine
    原文来自: http://blogs.sun.com/roller/page/rab

    译注者: Badcoffee
    Email: blog.oliver@gmail.com
    Blog: http://blog.csdn.net/yayong
    2005年7月

    按:要开始学习像操作系统这样复杂的东东是一个令人头痛的问题。为了帮助新学者理清头绪,这里我们将讨论Solaris X86和Solaris X64系统调用的基础构架。

    x86 syscall primer

    Getting started on a project as complex as an operating system can be quite a daunting(令人畏缩的) task. To help OpenSolaris newcomers sort out their head from their tail(理出头绪), here's a look at the system call infrastructure on Solaris x86 and Solaris x64.

    I'll go over the different system call methods used, their departure(出发) points in userland and entry points in the kernel, and then we'll actually follow one into the kernel with the debugger to see it all in action.

    注:
    1. 个人感觉学习操作系统最好的起点就是从系统调用来着手,因为系统调用是用户态进入到内核态的一个入口。看来不只是我们觉得操作系统复杂啊,连kernel 的developer都说它是be quite a daunting task。所以学习中止步不前时千万别灰心,呵呵。

    2. sort out their head from tail是个习语,意思大约是“理顺头绪”。

    Background

    Processors in the x86 world support a number of different system call methods, and some are faster than others. In Solaris, unoptimized(未优化的) system calls take one of three possible paths into the kernel:

    注:
    3.  x86处理器支持许多种系统调用的方式,其中一些方式要比另一些快。在Solaris中,未优化的系统调用使用了其中的一种可能的方式。

    lcall $0x27
    Used for years as the standard Solaris syscall method.
    int $0x91
    Used by linux for years, Solaris finally adopted int as the base syscall method in Solaris 11 (under development) - and earned a significant performance increase as a result. It will be available soon in a Solaris 10 update.
    lcall $0x7
    Used by some (very old) statically linked binaries.
    注:
    4. lcall实际上就是利用x86的调用门机制。lcall $0x27是solaris系统调用的标准使用方式。lcall $0x7则出现在solaris非常古老的静态链接库里。
    5. int方式实际上利用的是x86提供的中断门。int $0x91这种方式是Solaris在版本11马上要实现的一种方式,这种方式会显著提高性能,它也很快会出现在Solaris的update版本中。 Linux和FreeBSD实际上就是利用同样的机制,只不过它们用的是int $0x80,中断向量号不一样而已。

    x86的CPU支持4种不同的门调用机制:

    中断门 --  被用来处理硬件中断的相应
    陷阱门 --  被Windows/Linux/Unix系统用作中断处理和系统调用,异常处理
    调用门 --  Linux/Unix用来实现系统服务,兼容以前版本的应用
    任务门 --  现有的OS都不使用任务门,因为速度慢和任务数限制,只有早期的Linux2.0使用

    关于x86 CPU调用门的详细介绍,请参考Intel P4的手册卷3:系统编程

    Fast Syscalls and Hardware Capability Libraries

    When a well-behaved application makes a system call, it jumps through a wrapper(包装) function in libc. Changing the instruction used to enter the kernel becomes a matter of changing the wrappers in libc. Recently I integrated support for faster, chip-proprietary(芯片特有的) system calls into Solaris 10: sysenter (from Intel) and syscall (from AMD). Along with new kernel entry points, new hwcap (as in "hardware capability") versions of libc were provided to take advantage of the these new, faster instructions ( Tim Marsland has written about the hw capability architecture and Darren Moffat has written about how the system goes about selecting and using a hwcap libc).

    注:
    6. 应用程序调用系统调用,通常是通过libc里面的包装函数,包装函数最终会通过前面所述的几种指令中的一种,来进入到内核态。
    7. 最近,作者集成支持了更快的,芯片特有的系统调用指令到Solaris 10:Intel的sysenter和AMD的syscall。新的kernel的入口点,提供了新的hardware capability版本的libc库,它们利用了这些新的更快的指令。作者还给了另外相关的两篇文章。

    I often get confused about which system call method is used on which type of system. For the record, the following table shows which methods are supported by the various flavor combinations of x86 kernels, CPUs, and user application types shipping today:

    u64 = 64-bit user applications        u32 = 32-bit user applications



    syscallsysenter
    64-bit kernelIntel Xeonu64 (64-bit libc)u32 (hwcap1)
    AMD Opteronu64 (64-bit libc)
    u32 (hwcap2)
    -
    32-bit kernelIntel Xeon-u32 (hwcap1)
    AMD Opteronu32u32 (hwcap1)
    (The hwcap libraries referenced live in the /usr/lib/libc directory.)


    注:
    8. 上表给出了Intel Xeon和AMD Optern在32bit和64bitkernel的情况下,使用libc库的版本的情况:

    Solaris是64位内核时,64位的libc库(即u64)无论Xeon还是Optern都是使用的syscall指令,这大概是因为AMD在64位技术领先一步,intel不得不追随吧.
    Solaris是64位内核时,还同时为支持32位应用程序提供了32位的libc库,这时Solaris为两种CPU提供了不同的32位libc版本:

    u32 (hwcap1) -- libc的hardware capability 1版本,提供对Intel CPU快速系统调用指令SYSENTER/SYSEXIT的支持
    u32 (hwcap2) -- libc的hardware capability 2版本,提供对AMD的快速系统调用指令SYSCALL/SYSRET的支持


    关于Intel及AMD的快速系统调用指令可以参考Linux 2.6 对新型 CPU 快速系统调用的支持这篇文章。当然,更彻底是需要看一看Intel和AMD的系统编程手册了。

    Digging In

    To illustrate this, let's take a look at the libc source code. It lives in under the usr/src/lib/libc directory. The important entries here are:

    • i386/ - 32-bit source code and unoptimized binary
    • amd64/ - 64-bit source code and binary
    • i386_hwcap1/ - Intel CPU-specific source code and binary
    • i386_hwcap2/ - AMD CPU-specific source code and binary


    注:
    9. 这里给出了libc的源代码路径,通过查看i386/sysamd64/sys下syscall.s 的源代码,结合i386_hwcap1i386_hwcap2源代码目录下的Makefile文件的宏定义,即可了解4种libc版本的差异。


    A simple system call to use for this example is mkdir(2). We can use mdb to disassemble the text bits and see how libc jumps into the kernel:

    rab> mdb /lib/libc.so.1
    Loading modules: [ libc.so.1 ]
    > mkdir::dis
    mkdir: movl $0x50,%eax
    mkdir+5: syscall
    mkdir+7: jb -0x82847 <__cerror>
    mkdir+0xd: ret

    We can see that the system call number (See Eric Schrock's post for more information on system call numbers) is stashed away in register %eax so the kernel can find it later, and then the syscall instruction is executed to transfer control to the kernel.

    注:
    10. 这里用mdb可以反汇编libc的系统调用mkdir(2),可以看出只是一个简单的包装函数,通过把系统调用号放入eax寄存器,再用syscall指令来进入内核。
    12. mkdir的系统调用号是0x50即十进制的80,在syscall.h可以找到定义:

    #define SYS_mkdir 80


    This example is on an AMD Opteron system, because otherwise we'd expect to find either lcall $0x27 or sysenter as the control transfer instruction. We can get at the unoptimized libc by unmounting the hwcap library:

    rab> su
    Password:
    # umount /lib/libc.so.1
    rab> mdb /lib/libc.so.1
    Loading modules: [ libc.so.1 ]
    > mkdir::dis
    mkdir: movl $0x50,%eax
    mkdir+5: lcall $0x27,$0x0
    mkdir+0xc: jb -0x82b2c <__cerror>
    mkdir+0x12: ret

    注:
    13. umount掉libc.so.1后,这时就是未经优化的系统调用libc版本了,可以看到,发起系统调用的指令已经改成lcall $0x27了。作者应该是在Solaris10上做的实验,在OpenSolaris上,未优化的libc中系统调用应该已经用int $0x91了,请见我后面的注释15和16小节。


    Tracing it back to the source

    Ah-hah - now let's look at the source for the libc mkdir(2) wrapper to complete the userland picture:

    rab> pwd
    .../usr/src/lib/libc/common/sys
    rab> cat mkdir.s
    [ snip ]
    #include "SYS.h"

    SYSCALL_RVAL1(mkdir)
    RET
    SET_SIZE(mkdir)

    注:
    14. 这里展示了mkdir在libc里的实现,实际上就是用了SYSCALL_RVAL1这个宏,看表面意思这个宏应该是用在返回值只有一个的系统调用上的。

    In order to organize the source in a portable way that avoids reproducing the same code in more than one place, many portions of libc are implemented as preprocessor macros. mkdir(2) is so simple that it needs nothing but the SYSCALL macro, found in SYS.h. For reasons too boring to repeat here, the SYSCALL macro eventually expands into a corresponding SYSTRAP macro. All 32-bit variants of libc share one SYS.h, and preprocessor macros defined via Makefiles in the binary directories determine which instructions go into the SYSTRAP macro:

    注:
    15. 使用SYSCALL*的宏主要是多个地方避免重复编码,这个宏展开后对应着SYSTRAP的宏。SYSCALL*类的宏在 SYS.h文件里定义是随着结合i386_hwcap1i386_hwcap2源代码目录下的Makefile文件的宏定义来决定用哪一种SYSTRAP宏的。

    rab> pwd
    .../usr/src/lib/libc/i386/inc
    rab> grep SYSTRAP_RVAL1 SYS.h
    #define SYSTRAP_RVAL1(name) __SYSCALL(name)
    #define SYSTRAP_RVAL1(name) __SYSENTER(name)
    #define SYSTRAP_RVAL1(name) __SYSLCALL(name)

    One of the above macros are used depending on which libc is being built: __SYSCALL() for hwcap2, __SYSENTER() for hwcap1, and __SYSLCALL() for the unoptimized base libc at /lib/libc.so.1.

    注:
    16. 可以看到,根据i386_hwcap1i386_hwcap2目录下的Makefile文件里的宏定义,libc被build成使用__SYSCALL()的hwcap2版本或者使用__SYSENTER()的 hwcap1版本,再或者未优化的版本(如前所述,solaris 10用lcall $27, OpenSolaris用int $91)。

    事实上,所有32位的libc库,即便是hwcap1的libc库,也不是所有的系统调用全由__SYSENTER()来实现系统调用,对于多个返回值的系统调用,还是会用lcall $0x27或者int $0x91来实现的,在OpenSolaris32bit的libc的源代码sys.h中有如下定义:

    #define SYSTRAP_RVAL2(name)	__SYSCALLINT(name)
    #define SYSTRAP_2RVALS(name) __SYSCALLINT(name)
    #define SYSTRAP_64RVAL(name) __SYSCALLINT(name)

    可以看到,OpenSolaris对于多返回值的系统调用是用init $0x91实现的。
    rab> cat SYS.h
    [ snip ]
    #define __SYSLCALL(name) /
    /* CSTYLED */ /
    movl $SYS_/**/name, %eax; /
    lcall $SYSCALL_TRAPNUM, $0
    [ snip ]
    #define __SYSCALL(name) /
    /* CSTYLED */ /
    movl $SYS_/**/name, %eax; /
    .byte 0xf, 0x5 /* syscall */

    We added support for AMD's syscall instruction to Solaris, but we were using a slightly older version of our assembler which (embarassingly enough) didn't yet recognize the instruction, so its opcode had to be manually hard-coded into libc.

    注:
    17. 由于开发用的编译器版本略旧一些,还不能识别AMD Optern的syscall指令,因此在__SYSCALL的宏定义里直接使用了该指令的机器码。

    另外,可以在OpenSolaris的sys.h文件里找到支持新的int $0x91的实现:

    #define    __SYSCALLINT(name)        /
    /* CSTYLED */ /
    movl $SYS_/**/name, %eax; /
    int $T_SYSCALLINT

    Jumping Over the Fence(围栏)

    That's all for userland; the easy part is over. Because the actual workings of the differing system call instructions vary widely, the kernel uses separate code paths to deal with each. The function entry points used are (shown are only those for 32-bit applications making system calls):


    Entry InstructionKernel Entry Point
    64-bit kernellcall*trap()
    syscallsys_syscall32()
    sysentersys_sysenter()
    32-bit kernellcallsys_call()
    sysentersys_sysenter()

    *           In the 64-bit kernel, 32-bit system calls made via lcall come in to the system via a segment-not-present trap (#np), a matter which is beyond the scope of this document. Trust me, you don't want to get into segmentation now...

    注:
    18. 上表只给出了Solaris内核中的32位应用程序的系统调用入口。为支持各种系统调用指令,实际上内核同时实现了不同代码路径的处理函数。

    Seeing it in Action

    Using the kernel debugger we can step out of the classroom and watch these creatures in their native wild habitats. Boot a machine and from the system console get the kernel debugger loaded and ready. Enter the debugger, and then set a breakpoint on the syscall entry point. I'm still using the same Opteron machine as above (running the 64-bit kernel), so I need to re-mount the hwcap library:

    root> mount -O -F lofs /usr/lib/libc/libc_hwcap2.so.1 /lib/libc.so.1

    注:
    19. 由于之前作者已经umount了hwcap2的libc库,所以这里想使用hwcap2版本的话,需要重新mount该库到 /lib/libc.so.1。
    root> mdb -K
    Welcome to kmdb
    Loaded modules: [ cpc ptm ufs unix krtld sppp nca lofs genunix ip logindmux usba specfs nfs random sctp ]
    [0]> sys_syscall32:b
    [0]> :c
    kmdb: stop at sys_syscall32
    kmdb: target stopped at:
    sys_syscall32: swapgs
    [1]> ::cpuinfo
    ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
    0 fffffffffbc230a0 1b 0 0 60 no no t-0 ffffffff82b38520 fsflush
    1 ffffffff8bdd1800 1b 0 0 49 no no t-0 ffffffff8cc991e0 ksh

    We set a breakpoint, and tripped over(跳出) it immediately after continuing (because system calls are a very common occurrence on even an idle machine). We can see that CPU1 tripped(跳入) the breakpoint first (as evidenced by the [1] in the kmdb prompt), and that ksh is the process running. Which system call is the shell making?

    注:
    20. 作者利用mdb -k进入到kmdb来直接设置在64位内核的32位应用程序的系统调用入口函数sys_syscalll32(见前面的表格)设置内核断点, 然后又用:c来继续恢复内核运行:

    [0]> sys_syscall32:b ;设置断点
    [0]> :c ;继续恢复运行

    21. 即时在一台空闲的机器上,系统调用也是发生的非常频繁的,因此很快CPU1就运行到设置的断点处,这时kmdb的提示符就是[1]表示停在CPU1上。使用::cpuinfo可以看到,用户进程ksh在CPU1上运行。

     Remember that the libc wrapper function stashed the system call number in register %eax. When we are in the 64-bit kernel, %eax is the lower 32-bits of register %rax:

    [1]> <rax=D
    98

    注:
    22. libc的包装函数是用寄存器eax来存放调用号,Opteron中rax寄存器的低32位就是eax,因此这里直接察看其内容,转换成10进制数格式。

    syscall 98, which -- according to the sysent table (see sysent.c) -- is the shell doing a sigaction(2) (which makes sense, because shells are always messing around with signals).


    23. 可以看到,98号系统调用就是sigaction(2)是可以解释得通的,因为shell经常发信号。

    Clear the breakpoint and try the same thing with the 64-bit entry point (it is sys_syscall()), but this time enter the debugger by sending a break over the console (how one does this varies depending on the terminal being used to access the console):

    [1]> :z
    [1]> sys_syscall:b
    [1]> :c
    root>
    root>
    root>

    注:
    24. 清除之前的断点,然后在64位内核中的64位应用的系统调用入口函数sys_syscall处设置断点,然后继续运行。

    Because this is an otherwise idle machine, nothing trips the 64-bit syscall breakpoint just yet. There just aren't very many 64-bit processes running. We can run one manually to trigger the breakpoint:

    root> /usr/bin/amd64/ls 
    kmdb: stop at sys_syscall
    kmdb: target stopped at:
    sys_syscall: swapgs
    [1]> <rax=D
    115

    We see that the first 64-bit system call made by the 64-bit ls is mmap(2), which makes sense because the 64-bit dynamic linker needs to begin setting up the new process's address space.

    注:
    25. 由于这是台空闲机器,没有很多64位的应用程序在运行,因此继续运行后没有进入到断点处。因此作者手工执行64位的ls命令来使其进入断点。这时察看系统调用号,是mmap(2),这也是可以解释的,因为程序开始执行时,64位的动态链接器需要先用mmap设置新的进程地址空间。


    OpenSolaris
    Solaris
    mdb
  • 相关阅读:
    HIVE高级(14):优化(14) Hive On Spark配置
    HIVE高级(13):优化(13) Hive Job 优化
    HIVE高级(12):优化(12) 数据倾斜
    HIVE高级(11):优化(11) HQL 语法优化(2) 多表优化
    HIVE高级(10):优化(10) HQL 语法优化(1) 单表优化
    HIVE高级(9):优化(9) Hive 建表优化(1) 分区表/分桶表/合适的文件格式/合适的压缩格式
    HIVE高级(8):优化(8) Explain 查看执行计划(二)
    Hive基础(19):Hive 函数(2) 自定义函数/自定义 UDF 函数/自定义 UDTF 函数
    Hive基础(18):Hive语法(5) DDL(2) 分区表和分桶表
    MATLAB RGB2HSV、HSV2RGB
  • 原文地址:https://www.cnblogs.com/ainima/p/6330832.html
Copyright © 2011-2022 走看看