zoukankan      html  css  js  c++  java
  • OEM报错"Failed to connect to ASM instance. The connection is closed: The connection is closed"处理

    OEM报错"Failed to connect to ASM instance. The connection is closed: The connection is closed"处理

     

    前言

    秉着出现的报错就追根问底的原则,这次刚部署不久的OEM 13C有出现如下报警:

    Host=xxxxx1 
    Target type=Automatic Storage Management 
    Target name=+ASM1_xxxxx1 
    Categories=Availability 
    Message=Failed to connect to ASM instance. The connection is closed: The connection is closed 
    Severity=Fatal 
    Event reported time=Aug 9, 2020 10:08:18 AM CST 
    Operating System=Linux
    Platform=x86_64
    Associated Incident Id=88 
    Associated Incident Status=New 
    Associated Incident Owner= 
    Associated Incident Acknowledged By Owner=No 
    Associated Incident Priority=None 
    Associated Incident Escalation Level=0 
    Event Type=Target Availability 
    Event name=Status 
    Availability status=Down
    Root Cause Analysis Status=Neither Cause Nor Symptom 
    Causal analysis result=Neither a cause nor a symptom 
    Rule Name=Incident management rule set for all targets,Incident creation rule for a Target Down availability status 
    Rule Owner=System Generated 
    Update Details:
    Failed to connect to ASM instance. The connection is closed: The connection is closed
    Incident created by rule (Name = Incident management rule set for all targets, Incident creation rule for a Target Down availability status [System generated rule]).

    照例问度娘是没问出啥来......

    MOS上搜的话就有结果了:

    EM 13c: Enterprise Manager 13.2 Cloud Control ASM Incident Reported with Message=Failed To Connect To ASM Instance. The Connection Is Closed: The Connection Is Closed (Doc ID 2251591.1)

    文档中提到,这个一个BUG。

    验证

    文档中提到,在gcagent.log日志会有如下报错(示例):

    [65336:GC.Executor.126 (osm_instance:+ASM__host.company.com:ofs_performance_metrics) (osm_instance:+ASM__host.company.com:ofs_performance_metrics:Instance_Volume_Performance)] ERROR - The connection is closed: The connection is closed
    java.sql.SQLException: The connection is closed: The connection is closed
    at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:464)
    at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:448)
    at oracle.ucp.jdbc.proxy.JDBCConnectionProxyFactory.invoke(JDBCConnectionProxyFactory.java:307)
    at oracle.ucp.jdbc.proxy.ConnectionProxyFactory.invoke(ConnectionProxyFactory.java:50)
    at com.sun.proxy.$Proxy27.prepareCall(Unknown Source)

    该日志位于客户端如下位置:

    [oracle@xxxxx1 log]$ ll $AGENT_HOME/sysman/log/gcagent.log
    -rw-r----- 1 oracle oinstall 960998 Aug 10 14:20 /u01/app/oem13c/agent/agent_inst/sysman/log/gcagent.log

    查看日志可以发现,确实存在相似的日志信息:

    2020-08-09 10:08:18,645 [99899:GC.Executor.23807 (osm_instance:+ASM1_xxxxx1:Response) (osm_instance:+ASM1_xxxxx1:Response:Response)] ERROR - The connection is closed: The connection is closed
    java.sql.SQLException: The connection is closed: The connection is closed
            at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:464)
            at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:448)
            at oracle.ucp.jdbc.proxy.JDBCConnectionProxyFactory.invoke(JDBCConnectionProxyFactory.java:307)
            at oracle.ucp.jdbc.proxy.ConnectionProxyFactory.invoke(ConnectionProxyFactory.java:50)
            at com.sun.proxy.$Proxy31.prepareCall(Unknown Source)

    文档中还提到,

    此外,如果在EM代理进程上进行了线程转储,则会观察到大量的"Timer-"线程(它们随着时间的推移而增加,并且从未关闭/结束)。例:

    jstack <Agent PID>|grep "Timer-"|wc -l
    983

    注意:根据经验,"Timer-"线程的数量应随时间保持恒定,少于50,但这是一个近似值,因为它取决于目标数量,监视设置,执行的作业以及许多其他因素。关键因素是随着时间(天)的增加,此类线程的数量将保持恒定。

    问题节点再次验证如下:

    [oracle@xxxxx1 ~]# ps -ef | grep java
    ...省略部分内容...
    oracle    7687  7601  3 Aug05 ?        03:34:38 /u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/jdk/bin/java -Xmx128M -XX:MaxPermSize=128M -server -Djava.security.egd=file:///dev/./urandom -Dsun.lang.ClassLoader.allowArraySyntax=true -XX:-UseLargePages -XX:+UseLinuxPosixThreadCPUClocks -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCompressedOops -Dwatchdog.pid=7601 -cp /u01/app/oem13c/agent/agent_13.3.0.0.0/jdbc/lib/ojdbc7.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/ucp/lib/ucp.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/jsch-0.1.53.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/com.oracle.http_client.http_client_12.1.3.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.xdk_12.1.3/xmlparserv2.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.dms_12.1.3/dms.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.odl_12.1.3/ojdl.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.odl_12.1.3/ojdl2.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/lib/optic.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/log4j-core.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/jlib/gcagent_core.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/emagentSDK-intg.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/emagentSDK.jar oracle.sysman.gcagent.tmmain.TMMain   
    [oracle@xxxxx1 ~]$ /u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/jdk/bin/jstack 7687 | grep "Timer-" | wc -l
    83

    另外一个没报警的节点情况:

    [oracle@xxxxx2 ~]$ ps -ef | grep 13.3
    oracle   31845 31753  0 Aug05 ?        00:26:26 /u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/jdk/bin/java -Xmx128M -XX:MaxPermSize=128M -server -Djava.security.egd=file:///dev/./urandom -Dsun.lang.ClassLoader.allowArraySyntax=true -XX:-UseLargePages -XX:+UseLinuxPosixThreadCPUClocks -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCompressedOops -Dwatchdog.pid=31753 -cp /u01/app/oem13c/agent/agent_13.3.0.0.0/jdbc/lib/ojdbc7.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/ucp/lib/ucp.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/jsch-0.1.53.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/com.oracle.http_client.http_client_12.1.3.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.xdk_12.1.3/xmlparserv2.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.dms_12.1.3/dms.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.odl_12.1.3/ojdl.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.odl_12.1.3/ojdl2.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/lib/optic.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/log4j-core.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/jlib/gcagent_core.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/emagentSDK-intg.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/emagentSDK.jar oracle.sysman.gcagent.tmmain.TMMain
    [oracle@xxxxx2 ~]$ /u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/jdk/bin/jstack 31845 | grep "Timer-" | wc -l
    7

     文档中给出经验值为<50,我在问题节点可以看出"Timer-"线程的数量为83,作为一个参考值,说明该节点很有可能出现了BUG。

    处理

    这是BUG导致的,13.2/13.3/13.4均存在此问题,不过对应BUG号不同,因此补丁也不同。

    13.3的对应BUG为Bug 28406747,对应在agent段打上该补丁即可。

    如何打补丁

    首先是一个打补丁的目标的问题,之前给OMS打psu的时候虽然是第一次但是有给DB打PSU的经验倒是稍微折腾了下。

    这次确是一个小补丁,根据readme提到的,其中一步是需要关闭Management Agent,这个地方纠结了好一会。

    这个Management Agent指的是哪个?

    正常来讲,出现问题的节点在于数据库服务器上的agent端,所以应该是打在数据库服务器上的agent上,但是,

    这个management的单词让我觉得是oms上的agent端,并且如果是数据库服务器上的agent上那岂不是有很多台的agent都要关掉打上?

    而且说实话,oms上的agent是否和数据库服务器上的agent是一样的我都不确定(后来确定是一样的)。

    又是一阵度娘和mos,这次就找不出来啥了。

    后来又想到,其实在oms刚刚搭建完成后,默认在网页管理的目标“主机”就有了oms服务器本身,那其实无论是oms的agent还是db服务器上的agent,

    本质上应该是一个东西,于是尝试在oms上将agent停掉,

    $AGENT_HOME/bin/emctl stop agent

    果然,oem的网页还是可以登陆的,目标“主机”处oms本身的机器已经处于不健康的状态,看来确实是一样的。

    也就是,全部的agent都需要一个一个打上补丁......

    后边有想到一个问题,是否在oms上的agent打上补丁后,之后就算新推送到其他服务器上的agent估计就是带上了新打的补丁了呢?

    话不多说,先给oms的agent打上补丁,在推一个新的agent到未监控的db服务器上看看情况就知道了。

    首先,一定要先读补丁的readme,按照里边的要求一步一步来!!!

    第一,需要给agent的OPatch版本升级,由于oms的agent之前打psu的时候已经升级过了,因此这一步不再需要做。

    第二,设置环境变量,

    [oracle@oem13c agent]$ export ORACLE_HOME=/u01/app/oem13c/agent/agent_13.3.0.0.0
    [oracle@oem13c agent]$ /u01/app/oem13c/agent/agent_13.3.0.0.0/OPatch/opatch version
    OPatch Version: 13.9.3.3.0
    
    OPatch succeeded.

    这里扯点其他的,readme管这个目录/u01/app/oem13c/agent/agent_13.3.0.0.0叫agent core home,实际上,

    环境变量AGENT_HOME设置的值为/u01/app/oem13c/agent/agent_inst,这个值在推送客户端的时候叫instance directory,

    其中,/u01/app/oem13c/agent为agent的base目录,设置为AGENT_HOME=/u01/app/oem13c/agent/agent_inst原因是emctl命令在这个目录下的bin文件夹中。

    实际上打小补丁的应用目录是agent core home。

    继续回到打补丁这里,

    第三,关闭agent,

    [oracle@oem13c 28406747]$ export PATH=$ORACLE_HOME/bin:$ORACLE_HOME/OPatch:$PATH
    [oracle@oem13c 28406747]$ emctl stop agent
    Oracle Enterprise Manager Cloud Control 13c Release 3  
    Copyright (c) 1996, 2018 Oracle Corporation.  All rights reserved.
    Stopping agent ... stopped.
    [oracle@oem13c 28406747]$ opatch lspatches
    25237184;One-off
    24470104;
    
    OPatch succeeded.

    第四,直接应用补丁即可,

    [oracle@oem13c 28406747]$ opatch apply
    Oracle Interim Patch Installer version 13.9.3.3.0
    Copyright (c) 2020, Oracle Corporation.  All rights reserved.
    
    
    Oracle Home       : /u01/app/oem13c/agent/agent_13.3.0.0.0
    Central Inventory : /u01/app/oraInventory
       from           : /u01/app/oem13c/agent/agent_13.3.0.0.0/oraInst.loc
    OPatch version    : 13.9.3.3.0
    OUI version       : 13.9.1.0.0
    Log file location : /u01/app/oem13c/agent/agent_13.3.0.0.0/cfgtoollogs/opatch/opatch2020-08-10_16-39-09PM_1.log
    
    
    OPatch detects the Middleware Home as "/u01/app/oem13c/agent"
    
    Verifying environment and performing prerequisite checks...
    OPatch continues with these patches:   28406747  
    
    Do you want to proceed? [y|n]
    y
    User Responded with: Y
    All checks passed.
    Backing up files...
    Applying interim patch '28406747' to OH '/u01/app/oem13c/agent/agent_13.3.0.0.0'
    
    Patching component oracle.sysman.agent.ic, 13.3.0.0.0...
    Patch 28406747 successfully applied.
    Log file location: /u01/app/oem13c/agent/agent_13.3.0.0.0/cfgtoollogs/opatch/opatch2020-08-10_16-39-09PM_1.log
    
    OPatch succeeded.
    [oracle@oem13c 28406747]$ opatch lspatches
    28406747;
    25237184;One-off
    24470104;
    
    OPatch succeeded.

    最后,开启agent,

    [oracle@oem13c 28406747]$ emctl start agent

    至此,小补丁成功打上。

    后边推送新的agent到未监控的db服务器上,发现推送后,db上的agent是没有新的补丁的...

    所以还是要手动全部打一遍。

    一样的步骤,不是特别复杂。

    后续再观察"Timer-"线程的数量是否会再次异常以及是否还有报警产生。

  • 相关阅读:
    Vue3源码系列之触发更新的实现
    Vue3源码系列之依赖收集的实现
    Vue3源码系列之reactiveApi实现
    删除链表的倒数第n个节点
    Shared_ptr 参考实现
    linux 目录结构 比较老
    C++11 bind function
    状态机DP
    尾递归
    秒杀系统的构建(2)
  • 原文地址:https://www.cnblogs.com/PiscesCanon/p/13469665.html
Copyright © 2011-2022 走看看