kernel exception时打印出的ESR相关信息
<1>[ 7766.006249] Unhandled fault at 0xffffff800188d408 <1>[ 7766.006256] Mem abort info: <1>[ 7766.006259] ESR = 0x86000003 <1>[ 7766.006264] Exception class = IABT (current EL), IL = 32 bits <1>[ 7766.006268] SET = 0, FnV = 0 <1>[ 7766.006271] EA = 0, S1PTW = 0 <1>[ 7766.006277] swapper pgtable: 4k pages, 39-bit VAs, pgdp = 00000000352033d5 <1>[ 7766.006281] [ffffff800188d408] pgd=000000009d7fe003, pud=000000009d7fe003, pmd=00000000625c6003, pte=0040080063544793 <0>[ 7766.006294] Internal error: level 3 address size fault: 86000003 [#1] PREEMPT SMP
ESR相关信息说明
上述kernel exception时打印出的ESR(Exception Syndrome Register (EL1))值为0x86000003,看下ESR_EL1 register bit assignment:
ESR_EL1是一个64bit register,先要看EC(exception class) field,这个field是在这个register的bit[31:26],占6bit。
ISS依EC不同而有不同的含义。
此实例中EC值是0x21(0b100001),查看EC值解释表,可以得知0b100001是instruction abort,然后查看instruction abort对应的ISS
EC | Meaning | ISS | Applies when |
---|---|---|---|
0b000000 |
Unknown reason. |
ISS encoding for exceptions with an unknown reason | |
0b000001 |
Trapped WF* instruction execution. Conditional WF* instructions that fail their condition code check do not cause an exception. |
ISS encoding for an exception from a WF* instruction |
0b100001 |
Instruction Abort taken without a change in Exception level. Used for MMU faults generated by instruction accesses and synchronous External aborts, including synchronous parity or ECC errors. Not used for debug-related exceptions. |
ISS encoding for an exception from an Instruction Abort |
主要看IFSC bit field,这个bit field值的含义说明在如下的table里,在本实例中,IFSC bit field的值是3,所以是“Address size fault, level 3”
ISS encoding for an exception from an Instruction Abort
24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
RES0 | SET | FnV | EA | RES0 | S1PTW | RES0 | IFSC |
IFSC, bits [5:0]
Instruction Fault Status Code.
IFSC | Meaning | Applies when |
---|---|---|
0b000000 |
Address size fault, level 0 of translation or translation table base register. |
|
0b000001 |
Address size fault, level 1. |
|
0b000010 |
Address size fault, level 2. |
|
0b000011 |
Address size fault, level 3. |
|
0b000100 |
Translation fault, level 0. |
|
0b000101 |
Translation fault, level 1. |
其打印出来的IL = 32bits表示的是instruction length是32bit,即一条指令长度是4 byte
ESR_EL1 register具体说明见如下链接:
https://developer.arm.com/documentation/ddi0595/2021-06/AArch64-Registers/ESR-EL1--Exception-Syndrome-Register--EL1-?lang=en#fieldset_0-24_0_14-5_0
kernel exception是会打印出当前fault address对应的PGD/PUD/PMD/PTE
<1>[ 7766.006281] [ffffff800188d408] pgd=000000009d7fe003, pud=000000009d7fe003, pmd=00000000625c6003, pte=0040080063544793
pgd= 000000009d7fe003,
pud= 000000009d7fe003,
pmd=00000000625c6003,
pte= 0040080063544793
此kernel exception(KE)是发生在一台2G DRAM的ARM64机器上,所以看起来PGD/PUD/PMD page table descriptor的值是正常的。而PTE page table descriptor的值有问题,它所表示的物理地址是0x80063544000,对于2G DRAM的机器,物理地址应该要小于0xFFFFFFFF。
kernel oops log里的Code行log
[ 794.274311] Code: f946a2c9 12001eea 0b350157 9b1b2789 (39402529)
kernel里发生oops,比如data abort、instruction abort,此时会将哪一条指令触发的data abort、instruction abort以及其前面的几条打印出来,根据这条指令,可以定位出对应source code位置。
比如是在某个ko里某一个函数里发生的oops,则根据这个函数的反汇编代码,在里面搜索39402529,这条指令以及其前面几条如下,所以直接用39402529指令前的地址来执行llvm-symbolizer即可定位出对应source code位置:
llvm-symbolizer -e xxx.ko 0x39402529
227c7c: 12001eea and w10, w23, #0xff 227c80: 0b350157 add w23, w10, w21, uxtb 227c84: 9b1b2789 madd x9, x28, x27, x9 227c88: 39402529 ldrb w9, [x9,#9]
在这之前,可以根据PC所指向的函数的大小,和你反汇编出来的这个函数的汇编代码大小相比较,如果相等,可以确认这个ko或者vmlinux和发生此问题的image是相匹配的,比如如下PC所指向的函数的大小是0xb10:
[ 794.235944] XXX_OSD_WindowDestroy+0xb0/0xb10 [xxx.ko]
在反汇编出来的函数里搜索导致问题的instruction时,有可能搜到的不止一条,此时可能需要分析对应的汇编指令来确定是哪一条,或者在确认PC所指向的函数所说明的size和反汇编出来的这个函数的大小是一样的情况下,用这个函数的基地址加上offset,根据相加结果来定位对应的source code位置,比如上述PC所指向的位置在XXX_OSD_WindowDestroy()里的offset是0xb1