zoukankan      html  css  js  c++  java
  • oom_kill_process造成数据库挂起并出现found dead shared server

    这篇博客是上一篇博客Oracle shutdown immediate遭遇ORA-24324 ORA-24323 ORA-01089的延伸(数据库挂起hang时,才去重启的),其实这是我们海外一工厂的遇到的案例,把内容拆开是因为这个case分开讲述显得主题明确一些。正式进入主题:

     

    服务器数据库版本Oracle Database 10g Release 10.2.0.4.0,操作系统为Red Hat Enterprise Linux Server release 5.7,虚拟机。当时告警日志里面出现大量的found dead shared server这里信息。数据库也出现连接不上的情况

     

    found dead shared server 'S016', pid = (35, 23)
    found dead shared server 'S023', pid = (42, 1)
    Fri Aug  5 10:28:48 2016
    found dead shared server 'S013', pid = (32, 110)
    found dead shared server 'S021', pid = (40, 1)
    Fri Aug  5 10:33:53 2016
    found dead shared server 'S012', pid = (31, 132)
    found dead shared server 'S023', pid = (38, 3)
    Fri Aug  5 10:38:55 2016
    found dead shared server 'S013', pid = (32, 111)
    found dead shared server 'S022', pid = (42, 3)
    Fri Aug  5 10:40:53 2016
    found dead shared server 'S020', pid = (39, 4)
    found dead shared server 'S021', pid = (40, 3)
    failed to start shared server, oer=0

    clip_image001

     

    通过检查发现系统内存耗尽,出现了oom_kill 。OOMkiller,即out of memory killer,是linux下面当内存耗尽时的的一种处理机制。当内存较少时,OOM会遍历整个进程链表,然后根据进程的内存使用情况以及它的oom score值最终找到得分较高的进程,然后发送kill信号将其杀掉。

     

    clip_image002

    # grep -i kill /var/log/messages | more
    Aug  5 10:12:10 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
    Aug  5 10:12:10 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
    Aug  5 10:12:11 xxxxx kernel: Out of memory: kill process 21687 (oracle) score 2296119 or a child
    Aug  5 10:12:11 xxxxx kernel: Killed process 21687 (oracle)
    Aug  5 10:12:11 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
    Aug  5 10:12:11 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
    Aug  5 10:12:11 xxxxx kernel: Out of memory: kill process 21668 (oracle) score 2144517 or a child
    Aug  5 10:12:11 xxxxx kernel: Killed process 21668 (oracle)
    Aug  5 10:23:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
    Aug  5 10:23:09 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
    Aug  5 10:23:09 xxxxx kernel: Out of memory: kill process 21756 (oracle) score 2144517 or a child
    Aug  5 10:23:09 xxxxx kernel: Killed process 21756 (oracle)
    Aug  5 10:23:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
    Aug  5 10:23:09 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
    Aug  5 10:23:09 xxxxx kernel: Out of memory: kill process 21732 (oracle) score 2138384 or a child
    Aug  5 10:23:09 xxxxx kernel: Killed process 21732 (oracle)
    Aug  5 10:28:08 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
    Aug  5 10:28:08 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
    Aug  5 10:28:09 xxxxx kernel: Out of memory: kill process 21752 (oracle) score 2144521 or a child
    Aug  5 10:28:09 xxxxx kernel: Killed process 21752 (oracle)
    Aug  5 10:28:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
    Aug  5 10:28:09 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
    Aug  5 10:28:09 xxxxx kernel: Out of memory: kill process 21722 (oracle) score 2138377 or a child
    Aug  5 10:28:09 xxxxx kernel: Killed process 21722 (oracle)
    Aug  5 10:32:24 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
    Aug  5 10:32:24 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
    Aug  5 10:32:24 xxxxx kernel: Out of memory: kill process 21718 (oracle) score 2135307 or a child
    Aug  5 10:32:24 xxxxx kernel: Killed process 21718 (oracle)
    Aug  5 10:32:24 xxxxx kernel: gdm-rh-security invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
    Aug  5 10:32:24 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
    Aug  5 10:32:24 xxxxx kernel: Out of memory: kill process 22053 (oracle) score 2135300 or a child
    Aug  5 10:32:24 xxxxx kernel: Killed process 22053 (oracle)
    Aug  5 10:37:54 xxxxx kernel: beremote invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
    Aug  5 10:37:54 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
    Aug  5 10:37:54 xxxxx kernel: Out of memory: kill process 22238 (oracle) score 2134274 or a child
    Aug  5 10:37:54 xxxxx kernel: Killed process 22238 (oracle)
    Aug  5 10:37:54 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
    Aug  5 10:37:54 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
    Aug  5 10:37:54 xxxxx kernel: Out of memory: kill process 22128 (oracle) score 2133001 or a child
    --More--

    clip_image003

     

    从上面可以看到大量的ORACLE的进程被kill掉,从而导致ORACLE出现"found dead shared server 'S016', pid = (35, 23)"这类错误,在官方文档Found Dead Shared Server Messages Reported In Alert.Log (文档 ID 760872.1) 有如下介绍(这个文档较老旧,不过原理依然适用于此处环境):

     

    SYMPTOMS


    The following is being reported in the alert.log file

    Mon Dec 22 16:48:31 2008
    found dead shared server 'S004', pid = (13, 1)
    found dead shared server 'S001', pid = (10, 1)

    No further errors accompany those messages in the alert.log file.

    Listener.log shows no errors.

    CAUSE

    When a session is killed, in some circumstances, the shared server could die as well leading to those messages being reported in the alert.log file.
    This has been verified by << Bug 4270723 >> FOUND DEAD SHARED SERVER

     

    也就是说当一个会话被杀,在某些情况下, shared sever进程会死掉,导致上面“found dead shared server”出现在告警日志中。 这个案例中,由于系统大量kill掉会话进程,导致shared server进程死掉。所以Oracle数据库出现无法访问的情况。突然挂起。

     

    $grep "Out of memory" messages
    Aug  5 10:12:11 xxxxx kernel: Out of memory: kill process 21687 (oracle) score 2296119 or a child
    Aug  5 10:12:11 xxxxx kernel: Out of memory: kill process 21668 (oracle) score 2144517 or a child
    Aug  5 10:23:09 xxxxx kernel: Out of memory: kill process 21756 (oracle) score 2144517 or a child
    Aug  5 10:23:09 xxxxx kernel: Out of memory: kill process 21732 (oracle) score 2138384 or a child
    Aug  5 10:28:09 xxxxx kernel: Out of memory: kill process 21752 (oracle) score 2144521 or a child
    Aug  5 10:28:09 xxxxx kernel: Out of memory: kill process 21722 (oracle) score 2138377 or a child
    Aug  5 10:32:24 xxxxx kernel: Out of memory: kill process 21718 (oracle) score 2135307 or a child
    Aug  5 10:32:24 xxxxx kernel: Out of memory: kill process 22053 (oracle) score 2135300 or a child
    Aug  5 10:37:54 xxxxx kernel: Out of memory: kill process 22238 (oracle) score 2134274 or a child
    Aug  5 10:37:54 xxxxx kernel: Out of memory: kill process 22128 (oracle) score 2133001 or a child
    Aug  5 10:38:46 xxxxx kernel: Out of memory: kill process 22227 (oracle) score 2132996 or a child
    Aug  5 10:39:14 xxxxx kernel: Out of memory: kill process 22229 (oracle) score 2134280 or a child
    Aug  5 10:40:57 xxxxx kernel: Out of memory: kill process 22286 (oracle) score 2135299 or a child
    Aug  5 10:41:24 xxxxx kernel: Out of memory: kill process 22245 (oracle) score 2135302 or a child
    Aug  5 10:41:25 xxxxx kernel: Out of memory: kill process 22485 (oracle) score 2133009 or a child
    Aug  5 10:41:56 xxxxx kernel: Out of memory: kill process 21779 (oracle) score 2132880 or a child
    Aug  5 10:42:08 xxxxx kernel: Out of memory: kill process 22068 (oracle) score 2132873 or a child
    Aug  5 10:42:26 xxxxx kernel: Out of memory: kill process 22249 (oracle) score 2132873 or a child
    Aug  5 10:42:26 xxxxx kernel: Out of memory: kill process 22278 (oracle) score 2132873 or a child
    Aug  5 10:42:31 xxxxx kernel: Out of memory: kill process 21662 (oracle) score 2132872 or a child
    Aug  5 10:42:47 xxxxx kernel: Out of memory: kill process 22045 (oracle) score 2132872 or a child
    Aug  5 10:42:57 xxxxx kernel: Out of memory: kill process 22314 (oracle) score 2132872 or a child
    Aug  5 10:43:35 xxxxx kernel: Out of memory: kill process 22336 (oracle) score 2132872 or a child
    Aug  5 10:43:35 xxxxx kernel: Out of memory: kill process 22435 (oracle) score 2132870 or a child
    Aug  5 10:43:55 xxxxx kernel: Out of memory: kill process 21666 (oracle) score 2132869 or a child
    Aug  5 10:44:02 xxxxx kernel: Out of memory: kill process 22263 (oracle) score 2132869 or a child
    Aug  5 10:44:19 xxxxx kernel: Out of memory: kill process 22405 (oracle) score 2132866 or a child
    Aug  5 10:44:20 xxxxx kernel: Out of memory: kill process 22438 (oracle) score 2132866 or a child
    Aug  5 10:44:20 xxxxx kernel: Out of memory: kill process 22453 (oracle) score 2132865 or a child
    Aug  5 10:44:23 xxxxx kernel: Out of memory: kill process 22466 (oracle) score 2132737 or a child
    Aug  5 10:44:26 xxxxx kernel: Out of memory: kill process 22499 (oracle) score 2132607 or a child
    Aug  5 10:44:27 xxxxx kernel: Out of memory: kill process 21716 (oracle) score 1078417 or a child
    Aug  5 10:44:27 xxxxx kernel: Out of memory: kill process 21670 (oracle) score 1066455 or a child
    Aug  5 10:48:02 xxxxx kernel: Out of memory: kill process 22829 (oracle) score 2134273 or a child
    Aug  5 10:49:47 xxxxx kernel: Out of memory: kill process 22900 (oracle) score 2133007 or a child
    Aug  5 10:50:36 xxxxx kernel: Out of memory: kill process 22842 (oracle) score 2133095 or a child
    Aug  5 10:51:25 xxxxx kernel: Out of memory: kill process 22990 (oracle) score 2134285 or a child
    Aug  5 10:51:25 xxxxx kernel: Out of memory: kill process 23054 (oracle) score 2132994 or a child
    Aug  5 10:51:49 xxxxx kernel: Out of memory: kill process 22933 (oracle) score 2134277 or a child
    Aug  5 10:51:49 xxxxx kernel: Out of memory: kill process 23103 (oracle) score 2132996 or a child
    Aug  5 10:52:52 xxxxx kernel: Out of memory: kill process 23211 (oracle) score 2134267 or a child

     

    在官方文档Oracle VM Server hangs after Invoking the OOM Killer and having hundreds of kpartx processes spawned and "state S non-preferred supports toluSnA" reported on the FC LUNs (文档 ID 2123877.1)介绍了这么一个案例

     

    APPLIES TO:

    Oracle VM - Version 3.3.2 and later
    Linux x86-64

    SYMPTOMS

    On Oracle VM Server, during normal server operation, the server suddenly hangs, with the following kind of messages being logged about invoking the Out Of Memory Killer :

    $ grep "Out of memory" messages
    Mar 19 09:17:50 ovs01 kernel: Out of memory: Kill process 7382 (xend) score 2 or sacrifice child
    Mar 19 09:17:51 ovs01 kernel: Out of memory: Kill process 7382 (xend) score 2 or sacrifice child
    Mar 19 09:17:56 ovs01 kernel: Out of memory: Kill process 7425 (xend) score 2 or sacrifice child
    Mar 19 09:17:57 ovs01 kernel: Out of memory: Kill process 7425 (xend) score 2 or sacrifice child
    Mar 19 09:17:58 ovs01 kernel: Out of memory: Kill process 7425 (xend) score 2 or sacrifice child
    Mar 19 09:17:58 ovs01 kernel: Out of memory: Kill process 7425 (xend) score 2 or sacrifice child
    Mar 19 09:18:38 ovs01 kernel: Out of memory: Kill process 7557 (xend) score 2 or sacrifice child
    Mar 19 09:19:21 ovs01 kernel: Out of memory: Kill process 17926 (mounted.ocfs2) score 1 or sacrifice child

    Stack traces like this one are also being printed on the console :

    Mar 19 09:17:50 ovs01 kernel: Out of memory: Kill process 7382 (xend) score 2 or sacrifice child
    Mar 19 09:17:50 ovs01 kernel: Killed process 7422 (sh) total-vm:5512kB, anon-rss:40kB, file-rss:108kB
    Mar 19 09:17:51 ovs01 kernel: snmpd invoked oom-killer: gfp_mask=0x2800d0, order=0, oom_score_adj=0
    Mar 19 09:17:51 ovs01 kernel: snmpd cpuset=/ mems_allowed=0
    Mar 19 09:17:51 ovs01 kernel: Pid: 6602, comm: snmpd Tainted: G W 3.8.13-55.1.8.el6uek.x86_64 #2
    Mar 19 09:17:51 ovs01 kernel: Call Trace:
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff810d6bba>] ? cpuset_print_task_mems_allowed+0xba/0xe0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff81133c43>] dump_header+0x93/0x1f0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff8128d330>] ? ___ratelimit+0xa0/0x120
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff8159a43a>] ? error_exit+0x2a/0x60
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811343da>] oom_kill_process+0x20a/0x350
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff8122b735>] ? security_capable_noaudit+0x15/0x20
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff8106bcc5>] ? has_ns_capability_noaudit+0x15/0x20
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff8106bce7>] ? has_capability_noaudit+0x17/0x20
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811347a6>] out_of_memory+0x286/0x2d0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff8113a1a2>] __alloc_pages_slowpath+0x742/0x7c0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff8113a51b>] __alloc_pages_nodemask+0x2fb/0x320
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff8117ec27>] kmem_getpages+0x67/0x1c0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff8117fa99>] fallback_alloc+0x189/0x250
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff8117f85a>] ____cache_alloc_node+0x9a/0x150
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff8118102f>] kmem_cache_alloc+0x15f/0x2f0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fa370>] proc_alloc_inode+0x20/0xa0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811adec7>] alloc_inode+0x27/0xa0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811adf5b>] new_inode_pseudo+0x1b/0x70
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811adfcd>] new_inode+0x1d/0x40
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fc648>] proc_pid_make_inode+0x28/0x100
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fdbbb>] proc_pid_instantiate+0x1b/0xd0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fc2d6>] proc_fill_cache+0x126/0x150
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fdba0>] ? pid_revalidate+0x120/0x120
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fe25e>] proc_pid_readdir+0xfe/0x170
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fae3a>] proc_root_readdir+0x4a/0x60
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a5ab8>] vfs_readdir+0xb8/0xd0
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a5c65>] sys_getdents+0x95/0x110
    Mar 19 09:17:51 ovs01 kernel: [<ffffffff815a2599>] system_call_fastpath+0x16/0x1b

    In the processes dump before the Out Of Memory Killer starts killing processes, there are hundreds of kpartx instances :

    $ grep -c kpartx messages
    124438

    Error messages like the following one are also being reported :

    Mar 13 03:59:20 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA
    Mar 13 03:59:40 ovs01 kernel: sd 11:0:3:98: alua: port group 01 state S non-preferred supports toluSnA
    Mar 13 03:59:55 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA
    Mar 13 04:00:13 ovs01 kernel: sd 11:0:3:98: alua: port group 01 state S non-preferred supports toluSnA
    Mar 13 04:00:24 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA
    Mar 13 04:00:58 ovs01 kernel: sd 11:0:3:98: alua: port group 01 state S non-preferred supports toluSnA
    Mar 13 04:01:09 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA
    Mar 13 04:01:34 ovs01 kernel: sd 11:0:3:98: alua: port group 01 state S non-preferred supports toluSnA
    Mar 13 04:01:46 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA

     

    CAUSE

    Issue(s) on the Fiber Channel Storage Server which is hosting the LUNs used by the Oracle VM Server.
     

    SOLUTION

    To resolve this issue, please resolve the issue on the Fiber Channel Storage Array - The issue on the VM Server will then resolve as the issues on the Storage is fixed.

    跟这个案例非常类似,猜测就有可能就是这个原因造成的,需要海外那边的系统管理员去检查、处理。哎,但是一直没给我反馈。有时候要帮他们看着这些属于他们职责范围内的事情,还得不到有效支持,真心觉得累!

  • 相关阅读:
    cs231n --- 3 : Convolutional Neural Networks (CNNs / ConvNets)
    cs231n --- 1:线性svm与softmax
    Python的下划线_
    Python的类(class)
    python self
    MFC中应用对象的成员:窗口指针m_pMainWnd说明
    MSTP+VRRP组合组网
    VRRP组网下同网段内配置基于全局地址池的DHCP服务器
    路由器/交换机Console口登录密码丢失后如何恢复
    交换机处于同一网络作为DHCP中继与服务器
  • 原文地址:https://www.cnblogs.com/kerrycode/p/5743617.html
Copyright © 2011-2022 走看看