zoukankan      html  css  js  c++  java
  • ARM Linux 内核 panic 之cache 一致性 ——Cortex-A9多核cache和TLB一致性广播

    ARM Linux 内核 panic 之cache 一致性 ——Cortex-A9多核cache和TLB一致性广播

     

    Cortex-A9的多喝CPU可以接收和执行一致性广播操作,当其使能并处于SMP模式时。本文以内核的panic为例,在给出内核panic后的真正原因后,讨论Cortex-A9多核的cache和TLB的一致性广播,实际使用中应该怎么设置。

     

    1 多核启动android失败

    内核版本:3.0.15           CPU:Freescale Imx6Q(Cortex-A9四核)

    芯片特点:支持ARM TrustZone

    操作步骤:主核CPU0以Secure模式启动后,切换到NS模式,然后启动内核。内核启动其它的三个CPU,它们也会切换到NS模式,最后启动Android系统。

    但是启动失败了,后来发现内核只是panic,并没有彻底死机。为了确认panic后的状态,在内核的 arch/arm/kernel/smp.c文件,do_local_timer函数中,打印CPU的ID和时钟节拍,发现panic后,这个中断患有,信息还可以打印出来。

    原始日志如下:

    [   24.707074] request_suspend_state: wakeup (3->0) at 24564020006 (1970-01-02 00:14:26.233322336 UTC)

    [   24.719704] in panic, line:75     cpu:3

    [   24.726012] Kernel panic - not syncing: Attempted to kill init!

    [   24.732338] [<c004c65c>] (unwind_backtrace+0x0/0xfc) from [<c05633a0>] (panic+0x88/0x1b4)

    [   24.740586] [<c05633a0>] (panic+0x88/0x1b4) from [<c0076c60>] (do_exit+0x664/0x710)

    [   24.748278] [<c0076c60>] (do_exit+0x664/0x710) from [<c0076d48>] (do_group_exit+0x3c/0xbc)

    [   24.756581] [<c0076d48>] (do_group_exit+0x3c/0xbc) from [<c0082918>] (get_signal_to_deliver+0x1f8/0x430)

    [   24.766107] [<c0082918>] (get_signal_to_deliver+0x1f8/0x430) from [<c0048fec>] (do_signal+0x94/0x534)

    [   24.775440] [<c0048fec>] (do_signal+0x94/0x534) from [<c00494c4>] (do_notify_resume+0x38/0x44)

    [   24.784162] [<c00494c4>] (do_notify_resume+0x38/0x44) from [<c0046698>] (work_pending+0x24/0x28)

    [   24.792971] CPU1: stopping

    [   24.795700] [<c004c65c>] (unwind_backtrace+0x0/0xfc) from [<c00402d4>] (do_IPI+0x188/0x1bc)

    [   24.804064] [<c00402d4>] (do_IPI+0x188/0x1bc) from [<c004608c>] (__irq_svc+0x4c/0xe8)

    [   24.811895] Exception stack(0xd7551d78 to 0xd7551dc0)

    [   24.816948] 1d60:                                                       4657775f 00000001

    [   24.825130] 1d80: 00000101 00000101 cbf48cc0 d6c4a758 40464000 cbecaee0 4657775f d6c0d190

    [   24.833313] 1da0: c07dde00 0004a466 c0771c80 d7551dc0 c00de290 c0050688 60000113 ffffffff

    [   24.841505] [<c004608c>] (__irq_svc+0x4c/0xe8) from [<c0050688>] (__sync_icache_dcache+0x14/0xa0)

    [   24.850385] [<c0050688>] (__sync_icache_dcache+0x14/0xa0) from [<d6c0d000>] (0xd6c0d000)

    [   24.858482] CPU0: stopping

    [   24.861209] [<c004c65c>] (unwind_backtrace+0x0/0xfc) from [<c00402d4>] (do_IPI+0x188/0x1bc)

    [   24.869573] [<c00402d4>] (do_IPI+0x188/0x1bc) from [<c004608c>] (__irq_svc+0x4c/0xe8)

    [   24.877405] Exception stack(0xd752bec8 to 0xd752bf10)

    [   24.882460] bec0:                   d7610180 00100073 00000000 00000000 d7610180 00000000

    [   24.890642] bee0: 4ae957df d76cf208 cbaed9ec 40083000 d6739000 40082fff 00000001 d752bf10

    [   24.898821] bf00: c00e5d80 c00e5d80 60000013 ffffffff

    [   24.903884] [<c004608c>] (__irq_svc+0x4c/0xe8) from [<c00e5d80>] (mprotect_fixup+0x318/0x410)

    [   24.912417] [<c00e5d80>] (mprotect_fixup+0x318/0x410) from [<c00e5f94>] (sys_mprotect+0x11c/0x1c0)

    [   24.921385] [<c00e5f94>] (sys_mprotect+0x11c/0x1c0) from [<c0046640>] (ret_fast_syscall+0x0/0x30)

    [   24.930264] CPU2: stopping

    [   24.932988] [<c004c65c>] (unwind_backtrace+0x0/0xfc) from [<c00402d4>] (do_IPI+0x188/0x1bc)

    [   24.941349] [<c00402d4>] (do_IPI+0x188/0x1bc) from [<c0046328>] (__irq_usr+0x48/0xe0)

    [   24.949181] Exception stack(0xd75e9fb0 to 0xd75e9ff8)

    [   24.954236] 9fa0:                                     405923ec 401d9688 01010101 07000000

    [   24.962418] 9fc0: 78635f5f 5f5f0076 4058ecb4 4058e344 40590bd4 401d9686 401d9686 40238a60

    [   24.970598] 9fe0: 00005f5f becb84a8 b0003c5b b0001774 60000010 ffffffff

    [   25.073049] in do_local_timer, line:453  cpu:3

    [   26.073044] in do_local_timer, line:453    cpu:3

    跟踪内核发现,这个panic的执行流程是这样的。

    work_pending -> do_notify_resume -> do_signal -> get_signal_to_deliver -> do_group_exit -> 

    do_exit -> exit_notify -> forget_original_parent -> find_new_reaper -> panic("Attempted to kill init!");

    涉及到线程、进程的退出,以及线程父子之间的关系,暂时无法分析出来。

    怎么会走到kill init这一步,考虑到是多核环境下出现的,则尝试改为单核启动系统,然后再手动启动其它CPU,见下节描述。

    2 手动启动其它的CPU

    单核启动Android不死机,此时手动用命令启动其它CPU。

    echo 1 > /sys/devices/system/cpu/cpu1/online

    这样CPU1就可以起来,一段时间后,内核又panic了,日志如下。

    [   88.604151] XXXXXXXXXX  in panic, line:75        cpu:0

    [   88.610321] Kernel panic - not syncing: Attempted to kill init!    

    [   88.619172] [<c004c65c>] (unwind_backtrace+0x0/0xfc) from [<c05633a0>] (panic+0x88/0x1b4)

    [   88.627741] [<c05633a0>] (panic+0x88/0x1b4) from [<c0076c60>] (do_exit+0x664/0x710)

    [   88.635424] [<c0076c60>] (do_exit+0x664/0x710) from [<c0076d48>] (do_group_exit+0x3c/0xbc)

    [   88.643713] [<c0076d48>] (do_group_exit+0x3c/0xbc) from [<c0082918>] (get_signal_to_deliver+0x1f8/0x430)

    root@android:/ # [   88.653215] [<c0082918>] (get_signal_to_deliver+0x1f8/0x430) from [<c0048fec>] (do_signal+0x94/0x534)

    [   88.663905] [<c0048fec>] (do_signal+0x94/0x534) from [<c00494c4>] (do_notify_resume+0x38/0x44)

    [   88.672545] [<c00494c4>] (do_notify_resume+0x38/0x44) from [<c0046698>] (work_pending+0x24/0x28)

    [   88.681352] CPU1: stopping

    [   88.684082] [<c004c65c>] (unwind_backtrace+0x0/0xfc) from [<c00402d4>] (do_IPI+0x188/0x1bc)

    [   88.692449] [<c00402d4>] (do_IPI+0x188/0x1bc) from [<c004608c>] (__irq_svc+0x4c/0xe8)

    [   88.700281] Exception stack(0xd2cf1f90 to 0xd2cf1fd8)

    [   88.705337] 1f80:                                     00000020 c0771aa4 d2cf1fd8 00000000

    [   88.713520] 1fa0: d2cf0000 c07d0624 c0567c74 c077a0f4 1000406a 412fc09a 00000000 00000000

    [   88.721702] 1fc0: 00000001 d2cf1fd8 c0053aec c00471dc 60000013 ffffffff

    [   88.728328] [<c004608c>] (__irq_svc+0x4c/0xe8) from [<c00471dc>] (default_idle+0x24/0x28)

    [   88.736514] [<c00471dc>] (default_idle+0x24/0x28) from [<c00475b4>] (cpu_idle+0xbc/0xfc)

    [   88.744612] [<c00475b4>] (cpu_idle+0xbc/0xfc) from [<10560094>] (0x10560094)

    [   89.321214] in do_local_timer, line:453    cpu:0

    [   90.291213] in do_local_timer, line:453    cpu:0

    [   91.291213] in do_local_timer, line:453    cpu:0

    panic的信息跟上一节是一样的,都是按照那样的流程,最后走入了kill init那一步。

    3 为何多核SMP会panic

    既然能够定位到是多核导致的,只能将多核相关的寄存器仔细查看了。

    3.1 NS访问控制寄存器

    NSACR寄存器的描述如下图所示。这个寄存器在S模式是可以读写的,NS模式则为只读。

    它的NS_SMP位可以决定NS模式下,能否修改辅助控制寄存器的SMP位。

     

    3.2 辅助控制寄存器

    辅助控制寄存器如下所述,相关的是。

    一致性模式,SMP或者AMP;

    广播cache、分支预测、TLB的一致性操作。

    S模式下可以读写;

    NS下只读,若NSACR.NS_SMP是0;若这个位变成1,则NS下可以读写,这种情况下,其它位都是写忽略的,除了SMP位

     

    根据这个寄存器的描述,就是不管是否设置了它的FW位,它都可以从同簇的其它CPU那里,发送或者接收对内部共享的写回、写分配的一致性请求。

    言外之意:我的理解是,若是设置了SMP bit,则必须设置FW bit

    基于这个推测,结合上面这个寄存器的描述,CPU这样设置。

    在S模式,首先设置NSACR的NS_SMP位是1,然后设置辅助控制寄存器的SMP、FW位也是1,这样切换到NS模式后,也能修改辅助控制寄存器的SMP位,而它的FW位也是1。

    经过这样设置,多核启动Android成功了,系统没有再出现panic。

    4 后续问题怎么解决 

    上面的问题,是在定位到是多核导致后,经过修改寄存器,然后解决的。

    至于怎么根据panic的Kill init信息去跟踪,然后推导出是cache一致性没有处理好,最后内核奔溃的,没有好的思路。

    就是出现 Kernel panic - not syncing: Attempted to kill init!

    这个问题还是没有找到根本的解决思路。

  • 相关阅读:
    POJ 1401 Factorial
    POJ 2407 Relatives(欧拉函数)
    POJ 1730 Perfect Pth Powers(唯一分解定理)
    POJ 2262 Goldbach's Conjecture(Eratosthenes筛法)
    POJ 2551 Ones
    POJ 1163 The Triangle
    POJ 3356 AGTC
    POJ 2192 Zipper
    POJ 1080 Human Gene Functions
    POJ 1159 Palindrome(最长公共子序列)
  • 原文地址:https://www.cnblogs.com/fozu/p/4582213.html
Copyright © 2011-2022 走看看