zoukankan      html  css  js  c++  java
  • 硬件错误导致的crash

    [683650.031028] BUG: unable to handle kernel paging request at 000000000001b790-----------------------------地址错误
    [683650.031060] IP: [<ffffffff94b13520>] native_queued_spin_lock_slowpath+0x110/0x200
    [683650.031089] PGD 8000005f702d4067 PUD 510c27a067 PMD 0
    [683650.031109] Oops: 0002 [#1] SMP
    [683650.031567] CPU: 40 PID: 474165 Comm: dfget Kdump: loaded Tainted: G ------------ T 3.10.0-957.27.2.el7.x86_64 #1
    [683650.031599] Hardware name: Dell Inc. PowerEdge R640/0RJCR7, BIOS 1.6.13 12/17/2018
    [683650.031621] task: ffff8d37d86ac100 ti: ffff8d4adbbdc000 task.ti: ffff8d4adbbdc000
    [683650.031643] RIP: 0010:[<ffffffff94b13520>] [<ffffffff94b13520>] native_queued_spin_lock_slowpath+0x110/0x200
    [683650.031673] RSP: 0018:ffff8d4adbbdfea0 EFLAGS: 00010006
    [683650.031689] RAX: 000000000000175f RBX: ffff8d37d86ac100 RCX: 0000000001410000
    [683650.031709] RDX: 000000000001b790 RSI: 00000000bafb5140 RDI: ffff8d88afab1048----------这个就是sighand_struct.siglock
    [683650.031729] RBP: ffff8d4adbbdfea0 R08: ffff8d58bdf1b780 R09: 0000000000000000
    [683650.031749] R10: 0000000000000008 R11: 0000000000000206 R12: ffff8d4adbbdfef0
    [683650.031769] R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000
    [683650.031789] FS: 00007fb5b67fc700(0000) GS:ffff8d58bdf00000(0000) knlGS:0000000000000000
    [683650.031812] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [683650.031828] CR2: 000000000001b790 CR3: 0000005f757a8000 CR4: 00000000007607e0
    [683650.031849] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [683650.031869] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [683650.031889] PKRU: 55555554
    [683650.031898] Call Trace:
    [683650.031912] [<ffffffff9515e2cb>] queued_spin_lock_slowpath+0xb/0xf
    [683650.031935] [<ffffffff9516c6a8>] _raw_spin_lock_irq+0x28/0x30
    [683650.031955] [<ffffffff94ab0a97>] __set_current_blocked+0x37/0x70
    [683650.031974] [<ffffffff94ab0c67>] sigprocmask+0x77/0xb0
    [683650.032768] [<ffffffff94ab0d32>] SyS_rt_sigprocmask+0x92/0x100
    [683650.033480] [<ffffffff95176ddb>] system_call_fastpath+0x22/0x27
    [683650.034184] Code: 87 47 02 c1 e0 10 45 31 c9 85 c0 74 44 48 89 c2 c1 e8 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 80 b7 01 00 48 03 14 c5 a0 bf 74 95 <4c> 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 41 8b 40
    [683650.035684] RIP [<ffffffff94b13520>] native_queued_spin_lock_slowpath+0x110/0x200
    [683650.036404] RSP <ffff8d4adbbdfea0>
    [683650.037151] CR2: 000000000001b790

    根据内核代码分析,对应的sighand_struct.siglock值不正确,导致异常。

    crash> sighand_struct.siglock 0xffff8d88afab0840 -x
    siglock = {
    {
    rlock = {
    raw_lock = {
    val = {
    counter = 0x1415140
    }
    }
    }
    }
    }

    按照qspinlock的布局,最后一个8位应该是1,然后如果有人在等待锁,pending位应该位1,其次会有等待的cpu的编号记录。但是根据多年排查crash的经验,明显这个值是异常的。

    由于这个是一把sighand公共锁,所以打印其他线程的锁如下:

    crash> task_struct.sighand ffff8d88983da080
    sighand = 0xffff8d88afeb0840-------------------------e就是1110
    crash> task_struct.sighand ffff8d88983d8000
    sighand = 0xffff8d88afeb0840
    crash> task_struct.sighand ffff8d37d86ac100
    sighand = 0xffff8d88afab0840------------------------a就是1010

    仔细一点可以看出,这把锁发生了bit位翻转。这种情况等锁,就像在公交车站等一艘船,可能也就武汉能偶尔等到。

    另外,这个bit翻转在ipmi和mce中都未看到异常。

  • 相关阅读:
    Java 密码扩展无限制权限策略文件
    深入浅出Mybatis系列(八)---mapper映射文件配置之select、resultMap
    git 常用功能 _fei
    ActiveMQ 使用
    【dp】导弹拦截
    【dp】求最长上升子序列
    【贪心】1225 金银岛
    最大子矩阵
    归并排序求逆序对
    服务器安全部署文档
  • 原文地址:https://www.cnblogs.com/10087622blog/p/12149572.html
Copyright © 2011-2022 走看看