Diskmon
Master diskmon
• Monitors CELLSRVs and Network using Heartbeats • Propagates Cell/Network state to ASM/RDBMS processes (dskm) • Maintains a cluster-wide global view of Cells with other DISKMONs in the cluster • Accepts fencing requests from CRS and delivers them to the Cells • Accepts intradatabase IORM plans from RDBMS and sends them to the Cells • Provides communication with the cells
Diskmon daemon用以监控Exadata 中的cell server,即只有在Exadata环境中才有用。但是在版本11.2.0.1-11.2.0.2,即便是非Exadata环境也会默认启动该守护进 程。 在版本11.2.0.3 中 改进了这一细节,非Exadata环境无发启动diskmon了。 11.2.0.3 Grid Infrastructure diskmon Will be Offline by Default in Non-Exadata Environment What is being announced? As Grid Infrastructure daemon diskmon.bin is used for Exadata fencing, started from 11.2.0.3, resource ora.diskmon will be offline in non-Exadata environment. This is expected behaviour change. Prior to 11.2.0.3: ps -ef| grep diskmon.bin grid 3361 3166 0 22:57 ? 00:00:00 /ocw/grid/bin/diskmon.bin -d -f On 11.2.0.3: ps -ef| grep diskmon.bin >> no more diskmon.bin 一些diskmon进程的日志:[ CSSD]2009-07-27 10:27:36.419 [20] >TRACE: kgzf_dskm_conn4: unable to connect to master diskmon in 60174 msec [ CSSD]2009-07-27 10:27:36.419 [20] >TRACE: kgzf_send_main1: connection to master diskmon timed out [ CSSD]2009-07-27 10:27:36.421 [22] >TRACE: KGZF: Fatal diskmon condition, IO fencing is not available. For additional error info look at the master diskmon log file (diskmon.log) [ CSSD]2009-07-27 10:27:36.421 [22] >ERROR: ASSERT clsssc.c 2471 [ CSSD]2009-07-27 10:27:36.421 [22] >ERROR: clssscSAGEInitFenceCompl: Fence completion failed, rc 56859 It seems that the new process registered with Oracle Clusterware diskmon is not able to communicate properly . setsid: failed with -1/1 dskm_getenv_oracle_user: calling getpwnam_r for user oracle dskm_getenv_oracle_user: info for user oracle complete dskm_set_user: unable to change ownership for the log directory /optware/oracle/11.1.0.7/crs/log/shplab01/diskmon to user oracle, id 1101, errno 1 07/27/09 10:27:37: Master Diskmon starting The tusc output of the cssd log gives the following information ... 1248953770.528145 [/optware/ora][20944]{2992772} unlink("/var/spool/sockets/pwgr/client20944") ERR#2 ENOENT 1248953770.612485 [/optware/ora][20944]{2992772} unlink("/tmp/.oracle_master_diskmon") ERR#1 EPERM 1248953770.649479 [/optware/ora][20944]{2992772} unlink("/tmp/.oracle_master_diskmon") ERR#1 EPERM 1248953770.656719 [/optware/ora][20944]{2992772} unlink("/var/spool/sockets/pwgr/client20944") ERR#1 EPERM ... There is a permission error of the file /tmp/.oracle_master_diskmon. Solution The resolution is to change the permissions of the file /tmp/.oracle_master_diskmon, which should be owned by oracle . diskmon.log ============ 2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.6]{0}: (14036 -> 12265) SKGXP_CHECK_HEART_BEAT_RESP_EXPIRE: NO PATH to Monitor entry: 0x17161490 2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.7]{0}: Subnet: 0 2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.8]{0}: Remote endpoint [192.168.10.3/44538] is DOWN 2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.9]{0}: Local endpoint [192.168.10.1/45530] is UP 2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.10]{0}: SKGXP_DO_HEART_BEAT_RESP: Matching Monitor Entry Not Found 2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.11]{0}: SKGXPGPID Internet address 192.168.10.3 RDS port number 44538 2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] dskm_hb_thrd_main11: got status change 2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] dskm_ant_rsc_monitor_start: rscnam: o/192.168.10.3 rsc: 0x171609c0 state: UNREACHABLE reconn_attempts: 0 last_reconn_ts: 1322773921 2011-12-01 22:14:49.649: [ DISKMON][14036:1093384512] dskm_node_guids_are_offline: query SM done. retcode = 56891(REACHABLE) 2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] dskm_oss_get_net_info5: oss_get_net_info for device o/192.168.10.3 returned skgxpid 040302010001894cb5afca0419ed706ae92f000008000000000000000000000001030000c0a80a 03000000000000000000000000adfa00000000000016000000 and the following 1 ip adresess. known_reid: Yes 2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] 192.168.10.1 2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] dskm_ant_rsc_monitor_start6.5:Cell does support TCP monitor, and does support SM Query, cell incarnation is 1, guid num is 2 2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] GUID-0 = 0x0021280001a0af15 2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] GUID-1 = 0x0021280001a0af16 2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] dskm_ant_rsc_monitor_start2: Connected to Same_Inc OSS device: o/192.168.10.3 numIP: 1 2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] 192.168.10.1 2011-12-01 22:15:07.501: [ DISKMON][14036:1108523328] dskm_slave_thrd_main3: peer disconnected 2011-12-01 22:15:07.501: [ DISKMON][14036:1108523328] dskm_slave_thrd_main5: client +ASM1/ASM/15374 disconnected, reid cid=14e1b2b4de58ef1eff5487b58dccc906,icin=188142332,nmn=1,lnid=188142332,gid=7 ,gin=1,gmn=0,umemid=0,opid=8,opsn=1,lvl=process hdr=0x 0 2011-12-01 22:15:08.440: [ CSSCLNT]clsssRecvMsg: got a disconnect from the server while waiting for message type 1 2011-12-01 22:15:08.440: [ CSSCLNT]clssgsGroupGetStatus: communications failed (0/3/-1) 2011-12-01 22:15:08.440: [ CSSCLNT]clssgsGroupGetStatus: returning 8 2011-12-01 22:15:08.440: [ DISKMON][14036:1102219584] CRITICAL: Diskmon exiting: dskm_rac_thrd_main10: Diskmon is shutting down due to CSSD ABORT event 2011-12-01 22:15:08.440: [ DISKMON][14036:1102219584] SHUTDOWN FORCE due to CSSD ABORT 2011-12-01 22:15:08.440: [ DISKMON][14036:1102219584] dskm_rac_thrd_main: exiting 2011-12-01 22:15:08.754: [ DISKMON][14036:1104320832] dskm_slave_thrd_main5: client orarootagent/13701 disconnected, reid cid=DUMMY,icin=-1,nmn=-1,lnid=-1,gid=-1,gin=-1,gmn=-1,umemid=-1,opid=-1,opsn=- 1,lvl=process hdr=0xfece0100 2011-12-01 22:15:09.988: [ DISKMON][14036:1191118288] dskm_cleanup_thrds: cleaning up the rac event handling thread tid 1102219584 [ DISKMON][13016] I/O Fencing and SKGXP HA monitoring daemon -- Version 1.2.0.0 Process 13016 started on 2011-12-01 at 22:15:39.863 2011-12-01 22:15:39.867: [ DISKMON][13016] dskm main: starting up ocssd.log ========== 2011-12-01 22:15:04.223: [ CSSD][1127139648]clssgmmkLocalKillThread: Time up. Timeout 60500 Start time 369099698 End time 369160198 Current time 369160198 2011-12-01 22:15:04.223: [ CSSD][1127139648]clssgmmkLocalKillResults: Replying to kill request from remote node 2 kill id 1 Success map 0x00000000 Fail map 0x00000000 2011-12-01 22:15:04.224: [GIPCHAUP][1094015296] gipchaUpperProcessDisconnect: processing DISCONNECT for hendp 0x2aa5550 [00000000000092e5] { gipchaEndpoint : port 'nm2_gts-cluster/af9c-724c-2e3f-3946', peer 'gts1db02:205f-3cac-025e-c962', srcCid 00000000-000092e5, dstCid 00000000-000009d9, numSend 0, maxSend 100, usrFlags 0x4000, flags 0x204 } 2011-12-01 22:15:04.224: [ CSSD][1122408768]clssnmeventhndlr: Disconnecting endp 0x932d ninf 0x1c3a2c0 2011-12-01 22:15:04.224: [ CSSD][1122408768]clssnmDiscHelper: gts1db02, node(2) connection failed, endp (0x932d), probe(0x3000000000), ninf->endp 0x932d 2011-12-01 22:15:04.224: [ CSSD][1122408768]clssnmDiscHelper: node 2 clean up, endp (0x932d), init state 3, cur state 3 2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcInternalDissociate: obj 0x2e99290 [000000000000932d] { gipcEndpoint : localAddr 'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr 'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags 0x0 } not associated with any container, ret gipcretFail (1) 2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3284]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 0x2e99290 [000000000000932d] { gipcEndpoint : localAddr 'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr 'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags 0x0 }, flags 0x0 2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcInternalDissociate: obj 0x2e99290 [000000000000932d] { gipcEndpoint : localAddr 'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr 'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags 0x0 } not associated with any container, ret gipcretFail (1) 2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3430]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 0x2e99290 [000000000000932d] { gipcEndpoint : localAddr 'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr 'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags 0x0 }, flags 0x0 2011-12-01 22:15:04.224: [ CSSD][1122408768]clssnmDiscEndp: gipcDestroy 0x932d 2011-12-01 22:15:04.603: [ CSSD][1104976192](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node gts1db02, number 2, sync 188142334, stamp 393990918 2011-12-01 22:15:04.603: [ CSSD][1104976192]################################### 2011-12-01 22:15:04.603: [ CSSD][1104976192]clssscExit: CSSD aborting from thread clssnmvKillBlockThread 2011-12-01 22:15:04.603: [ CSSD][1104976192]################################### 2011-12-01 22:15:04.603: [ CSSD][1104976192](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally gts1db01, number 1, has experienced a failure in thread number 10 and is shutting down 2011-12-01 22:15:04.603: [ CSSD][1104976192]clssscExit: Starting CRSD cleanup 2011-12-01 22:15:04.737: [ CSSD][1103399232]clssgmDiscEndpcl: gipcDestroy 0xa2ea2f7 2011-12-01 22:15:04.925: [ CSSD][1112947008](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 3472942430 ms for voting file o/192.168.10.5/DBFS_DG_CD_04_gts1cel03) 2011-12-01 22:15:04.925: [ CSSD][1112947008]clssnmvDiskAvailabilityChange: voting file o/192.168.10.5/DBFS_DG_CD_04_gts1cel03 now offline 2011-12-01 22:15:04.925: [ CSSD][1112947008](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 3472942450 ms for voting file o/192.168.10.4/DBFS_DG_CD_02_gts1cel02) 2011-12-01 22:15:04.925: [ CSSD][1112947008]clssnmvDiskAvailabilityChange: voting file o/192.168.10.4/DBFS_DG_CD_02_gts1cel02 now offline 2011-12-01 22:15:04.925: [ CSSD][1112947008](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 3472942480 ms for voting file o/192.168.10.3/DBFS_DG_CD_02_gts1cel01) 2011-12-01 22:15:04.926: [ CSSD][1112947008]clssnmvDiskAvailabilityChange: voting file o/192.168.10.3/DBFS_DG_CD_02_gts1cel01 now offline 2011-12-01 22:15:04.926: [ CSSD][1112947008](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 3 configured voting disks available, need 2 2011-12-01 22:15:04.926: [ CSSD][1112947008]clssscExit: abort already set 1 2011-12-01 22:15:04.926: [ SKGFD][1109793088]Lib :OSS:: closing handle 0x2538e70 for disk :o/192.168.10.5/DBFS_DG_CD_04_gts1cel03: 2011-12-01 22:15:04.926: [ SKGFD][1098676544]Lib :OSS:: closing handle 0x2aaaac0d7cb0 for disk :o/192.168.10.3/DBFS_DG_CD_02_gts1cel01: Heartbeat timeout logic may to fail to detect dead cells if diskmon has been running for over 40 days. Rediscovery Notes: If diskmon has been running for over 40 days and DB processes start to hang after a cell death, you may have hit this bug. All nodes may hang due to one of the heartbeat threads in diskmon getting stuck trying to notify the instance(s) that it reconnected to the cell. However if this occurs there is insufficient diagnostic data collected to help confirm why the hang occurred. This fix is a diagnostic enhancement for this scenario. If diskmon/DSKM processes are hung/stuck this fix may help collect additional useful diagnostics. PROBLEM: -------- Diskmon logs fill up quickly causing their disk / volume /u00 to become full DIAGNOSTIC ANALYSIS: -------------------- 1) The following are consistently and repeatedly logged: 2010-12-17 03:22:25.848: [ DISKMON][17796:1089268032] dskm_rac_ini13: calling clssgsqgrp 2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_rac_ini80: called clssgsqgrp: 2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_dump_group_priv: vers: 0 flags: 0x0 confInc: 0 My confInc: 0 2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_dump_group_priv: CSS Msg Hdr: vers: 0 type: UNKNOWN (0) chunks: NO MORE CHUNKS (0) transport: UNKNOWN (0) mSize: 0 2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_dump_group_priv: Group Private Data is not of type DSKM_MSG_SS_REQ. Not proceeding with msg dump 2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_rac_ini15: Found my member number 1 to be busy. Waiting (attempts: 598) for OCSSD to clean up previous incarnation 2) Core files generated, and many stack dumps of diskmon futher enlarges the diskmon.log The following is frequently seen in the diskmon.log 2010-12-17 03:22:28.855: [ DISKMON][17796:1089268032] dskm_rac_ini16: OCSSD has notified that another diskmon is currently running in this node. This might be a duplicate startup. If not consult OCSSD log for additional information. 2010-12-17 03:22:28.855: [ DISKMON][17796] INCIDENT : EXCEPTION (SIGNAL: 6) in [gsignal()] 2010-12-17 03:22:28.855: [ DISKMON][17796] Thread 1089268032 got exception 6 2010-12-17 03:22:28.855: [ DISKMON][17796] Stack dump for thread 1089268032 [ DISKMON][17796] ....