zoukankan      html  css  js  c++  java
  • 了解更多关于11gR2 diskmon

      下图显示了有ohasd管理的所有资源(resource)/守护进程(daemons)的依赖关系:    

    Diskmon

     

    Master diskmon

    • Monitors CELLSRVs and Network using Heartbeats • Propagates Cell/Network state to ASM/RDBMS processes (dskm) • Maintains a cluster-wide global view of Cells with other DISKMONs in the cluster • Accepts fencing requests from CRS and delivers them to the Cells • Accepts intradatabase IORM plans from RDBMS and sends them to the Cells • Provides communication with the cells

        Diskmon daemon用以监控Exadata 中的cell server,即只有在Exadata环境中才有用。但是在版本11.2.0.1-11.2.0.2,即便是非Exadata环境也会默认启动该守护进 程。  在版本11.2.0.3 中 改进了这一细节,非Exadata环境无发启动diskmon了。 11.2.0.3 Grid Infrastructure diskmon Will be Offline by Default in Non-Exadata Environment What is being announced? As Grid Infrastructure daemon diskmon.bin is used for Exadata fencing, started from 11.2.0.3, resource ora.diskmon will be offline in non-Exadata environment. This is expected behaviour change. Prior to 11.2.0.3: ps -ef| grep diskmon.bin grid      3361  3166  0 22:57 ?        00:00:00 /ocw/grid/bin/diskmon.bin -d -f On 11.2.0.3: ps -ef| grep diskmon.bin >> no more diskmon.bin     一些diskmon进程的日志:  
    [ CSSD]2009-07-27 10:27:36.419 [20] >TRACE: kgzf_dskm_conn4: unable to connect to master
    diskmon in 60174 msec
    
    [ CSSD]2009-07-27 10:27:36.419 [20] >TRACE: kgzf_send_main1: connection to master diskmon
    timed out
    
    [ CSSD]2009-07-27 10:27:36.421 [22] >TRACE: KGZF: Fatal diskmon condition, IO fencing is
    not available. For additional error info look at the master diskmon log file (diskmon.log)
    
    [ CSSD]2009-07-27 10:27:36.421 [22] >ERROR: ASSERT clsssc.c 2471
    [ CSSD]2009-07-27 10:27:36.421 [22] >ERROR: clssscSAGEInitFenceCompl: Fence completion
    failed, rc 56859
    
    It seems that the new process registered with Oracle Clusterware diskmon is not able to communicate properly .
    
    setsid: failed with -1/1
    dskm_getenv_oracle_user: calling getpwnam_r for user oracle
    dskm_getenv_oracle_user: info for user oracle complete
    dskm_set_user: unable to change ownership for the log directory
    /optware/oracle/11.1.0.7/crs/log/shplab01/diskmon to user oracle, id 1101, errno 1
    07/27/09 10:27:37: Master Diskmon starting
    
    The tusc output of the cssd log gives the following information
    ...
    1248953770.528145 [/optware/ora][20944]{2992772}
    unlink("/var/spool/sockets/pwgr/client20944") ERR#2 ENOENT
    1248953770.612485 [/optware/ora][20944]{2992772}
    unlink("/tmp/.oracle_master_diskmon") ERR#1 EPERM
    1248953770.649479 [/optware/ora][20944]{2992772}
    unlink("/tmp/.oracle_master_diskmon") ERR#1 EPERM
    1248953770.656719 [/optware/ora][20944]{2992772}
    unlink("/var/spool/sockets/pwgr/client20944") ERR#1 EPERM
    ...
    
     There is a permission error of the file /tmp/.oracle_master_diskmon.
    Solution
    
    The resolution is to change the permissions of the file  /tmp/.oracle_master_diskmon, which should be owned by oracle . 
    
    diskmon.log
    ============
    2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.6]{0}:
    (14036 -> 12265) SKGXP_CHECK_HEART_BEAT_RESP_EXPIRE: NO PATH to Monitor
    entry: 0x17161490
    2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.7]{0}:
      Subnet: 0
    2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.8]{0}:
       Remote endpoint [192.168.10.3/44538] is DOWN
    2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.9]{0}:
       Local endpoint [192.168.10.1/45530] is UP
    2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512]
    SKGXP:[386927568.10]{0}: SKGXP_DO_HEART_BEAT_RESP: Matching Monitor Entry Not
    Found
    2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512]
    SKGXP:[386927568.11]{0}:   SKGXPGPID Internet address 192.168.10.3 RDS port
    number 44538
    2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] dskm_hb_thrd_main11:
    got status change
    2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512]
    dskm_ant_rsc_monitor_start: rscnam: o/192.168.10.3 rsc: 0x171609c0 state:
    UNREACHABLE reconn_attempts: 0 last_reconn_ts: 1322773921
    2011-12-01 22:14:49.649: [ DISKMON][14036:1093384512]
    dskm_node_guids_are_offline: query SM done. retcode = 56891(REACHABLE)
    2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] dskm_oss_get_net_info5:
    oss_get_net_info for device o/192.168.10.3 returned skgxpid
    040302010001894cb5afca0419ed706ae92f000008000000000000000000000001030000c0a80a
    03000000000000000000000000adfa00000000000016000000 and the following 1 ip
    adresess. known_reid: Yes
    2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512]     192.168.10.1
    2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512]
    dskm_ant_rsc_monitor_start6.5:Cell does support TCP monitor, and does support
    SM Query, cell incarnation is 1, guid num is 2
    2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] GUID-0 =
    0x0021280001a0af15
    2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] GUID-1 =
    0x0021280001a0af16
    2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512]
    dskm_ant_rsc_monitor_start2: Connected to Same_Inc OSS device: o/192.168.10.3
    numIP: 1
    2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512]     192.168.10.1
    
    2011-12-01 22:15:07.501: [ DISKMON][14036:1108523328] dskm_slave_thrd_main3:
    peer disconnected
    2011-12-01 22:15:07.501: [ DISKMON][14036:1108523328] dskm_slave_thrd_main5:
    client +ASM1/ASM/15374 disconnected, reid
    cid=14e1b2b4de58ef1eff5487b58dccc906,icin=188142332,nmn=1,lnid=188142332,gid=7
    ,gin=1,gmn=0,umemid=0,opid=8,opsn=1,lvl=process hdr=0x       0
    
    2011-12-01 22:15:08.440: [ CSSCLNT]clsssRecvMsg: got a disconnect from the
    server while waiting for message type 1
    2011-12-01 22:15:08.440: [ CSSCLNT]clssgsGroupGetStatus:  communications
    failed (0/3/-1)
    
    2011-12-01 22:15:08.440: [ CSSCLNT]clssgsGroupGetStatus: returning 8
    
    2011-12-01 22:15:08.440: [ DISKMON][14036:1102219584] CRITICAL: Diskmon
    exiting: dskm_rac_thrd_main10: Diskmon is shutting down due to CSSD ABORT
    event
    2011-12-01 22:15:08.440: [ DISKMON][14036:1102219584] SHUTDOWN FORCE due to
    CSSD ABORT
    2011-12-01 22:15:08.440: [ DISKMON][14036:1102219584] dskm_rac_thrd_main:
    exiting
    2011-12-01 22:15:08.754: [ DISKMON][14036:1104320832] dskm_slave_thrd_main5:
    client orarootagent/13701 disconnected, reid
    cid=DUMMY,icin=-1,nmn=-1,lnid=-1,gid=-1,gin=-1,gmn=-1,umemid=-1,opid=-1,opsn=-
    1,lvl=process hdr=0xfece0100
    2011-12-01 22:15:09.988: [ DISKMON][14036:1191118288] dskm_cleanup_thrds:
    cleaning up the rac event handling thread tid 1102219584
    [ DISKMON][13016]
    
    I/O Fencing and SKGXP HA monitoring daemon -- Version 1.2.0.0
    Process 13016 started on 2011-12-01 at 22:15:39.863
    
    2011-12-01 22:15:39.867: [ DISKMON][13016] dskm main: starting up
    
    ocssd.log
    ==========
    2011-12-01 22:15:04.223: [    CSSD][1127139648]clssgmmkLocalKillThread: Time
    up. Timeout 60500 Start time 369099698 End time 369160198 Current time
    369160198
    2011-12-01 22:15:04.223: [    CSSD][1127139648]clssgmmkLocalKillResults:
    Replying to kill request from remote node 2 kill id 1 Success map 0x00000000
    Fail map 0x00000000
    2011-12-01 22:15:04.224: [GIPCHAUP][1094015296] gipchaUpperProcessDisconnect:
    processing DISCONNECT for hendp 0x2aa5550 [00000000000092e5] { gipchaEndpoint
    : port 'nm2_gts-cluster/af9c-724c-2e3f-3946', peer
    'gts1db02:205f-3cac-025e-c962', srcCid 00000000-000092e5,  dstCid
    00000000-000009d9, numSend 0, maxSend 100, usrFlags 0x4000, flags 0x204 }
    2011-12-01 22:15:04.224: [    CSSD][1122408768]clssnmeventhndlr:
    Disconnecting endp 0x932d ninf 0x1c3a2c0
    2011-12-01 22:15:04.224: [    CSSD][1122408768]clssnmDiscHelper: gts1db02,
    node(2) connection failed, endp (0x932d), probe(0x3000000000), ninf->endp
    0x932d
    2011-12-01 22:15:04.224: [    CSSD][1122408768]clssnmDiscHelper: node 2 clean
    up, endp (0x932d), init state 3, cur state 3
    2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcInternalDissociate: obj
    0x2e99290 [000000000000932d] { gipcEndpoint : localAddr
    'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr
    'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0,
    numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags
    0x0 } not associated with any container, ret gipcretFail (1)
    2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcDissociateF
    [clssnmDiscHelper : clssnm.c : 3284]: EXCEPTION[ ret gipcretFail (1) ]  
    failed to dissociate obj 0x2e99290 [000000000000932d] { gipcEndpoint :
    localAddr 'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr
    'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0,
    numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags
    0x0 }, flags 0x0
    2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcInternalDissociate: obj
    0x2e99290 [000000000000932d] { gipcEndpoint : localAddr
    'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr
    'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0,
    numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags
    0x0 } not associated with any container, ret gipcretFail (1)
    2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcDissociateF
    [clssnmDiscHelper : clssnm.c : 3430]: EXCEPTION[ ret gipcretFail (1) ]  
    failed to dissociate obj 0x2e99290 [000000000000932d] { gipcEndpoint :
    localAddr 'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr
    'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0,
    numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags
    0x0 }, flags 0x0
    2011-12-01 22:15:04.224: [    CSSD][1122408768]clssnmDiscEndp: gipcDestroy
    0x932d
    2011-12-01 22:15:04.603: [    
    CSSD][1104976192](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by
    node gts1db02, number 2, sync 188142334, stamp 393990918
    2011-12-01 22:15:04.603: [    
    CSSD][1104976192]###################################
    2011-12-01 22:15:04.603: [    CSSD][1104976192]clssscExit: CSSD aborting from
    thread clssnmvKillBlockThread
    2011-12-01 22:15:04.603: [    
    CSSD][1104976192]###################################
    2011-12-01 22:15:04.603: [    CSSD][1104976192](:CSSSC00012:)clssscExit: A
    fatal error occurred and the CSS daemon is terminating abnormally
    gts1db01, number 1, has experienced a failure in thread number 10 and is
    shutting down
    2011-12-01 22:15:04.603: [    CSSD][1104976192]clssscExit: Starting CRSD
    cleanup
    
    2011-12-01 22:15:04.737: [    CSSD][1103399232]clssgmDiscEndpcl: gipcDestroy
    0xa2ea2f7
    2011-12-01 22:15:04.925: [    
    CSSD][1112947008](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for
    3472942430 ms for voting file o/192.168.10.5/DBFS_DG_CD_04_gts1cel03)
    2011-12-01 22:15:04.925: [    CSSD][1112947008]clssnmvDiskAvailabilityChange:
    voting file o/192.168.10.5/DBFS_DG_CD_04_gts1cel03 now offline
    2011-12-01 22:15:04.925: [    
    CSSD][1112947008](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for
    3472942450 ms for voting file o/192.168.10.4/DBFS_DG_CD_02_gts1cel02)
    2011-12-01 22:15:04.925: [    CSSD][1112947008]clssnmvDiskAvailabilityChange:
    voting file o/192.168.10.4/DBFS_DG_CD_02_gts1cel02 now offline
    2011-12-01 22:15:04.925: [    
    CSSD][1112947008](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for
    3472942480 ms for voting file o/192.168.10.3/DBFS_DG_CD_02_gts1cel01)
    2011-12-01 22:15:04.926: [    CSSD][1112947008]clssnmvDiskAvailabilityChange:
    voting file o/192.168.10.3/DBFS_DG_CD_02_gts1cel01 now offline
    2011-12-01 22:15:04.926: [    
    CSSD][1112947008](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 3 configured
    voting disks available, need 2
    2011-12-01 22:15:04.926: [    CSSD][1112947008]clssscExit: abort already set
    1
    2011-12-01 22:15:04.926: [   SKGFD][1109793088]Lib :OSS:: closing handle
    0x2538e70 for disk :o/192.168.10.5/DBFS_DG_CD_04_gts1cel03:
    
    2011-12-01 22:15:04.926: [   SKGFD][1098676544]Lib :OSS:: closing handle
    0x2aaaac0d7cb0 for disk :o/192.168.10.3/DBFS_DG_CD_02_gts1cel01:
    
    Heartbeat timeout logic may to fail to detect dead cells if diskmon
    has been running for over 40 days.
    
    Rediscovery Notes:
     If diskmon has been running for over 40 days and DB processes start to hang
     after a cell death, you may have hit this bug.
    
     All nodes may hang due to one of the heartbeat threads in diskmon
    getting stuck trying to notify the instance(s) that it reconnected
    to the cell. However if this occurs there is insufficient diagnostic
    data collected to help confirm why the hang occurred.
    This fix is a diagnostic enhancement for this scenario.
    
    If diskmon/DSKM processes are hung/stuck this fix may help collect
    additional useful diagnostics.
    
    PROBLEM:
    --------
    Diskmon logs fill up quickly causing their disk / volume /u00  to become full
    
    DIAGNOSTIC ANALYSIS:
    --------------------
    1)  The following are consistently and repeatedly logged:
    
    2010-12-17 03:22:25.848: [ DISKMON][17796:1089268032] dskm_rac_ini13: calling
    clssgsqgrp
    2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_rac_ini80: called
    clssgsqgrp:
    2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_dump_group_priv:
    vers: 0 flags: 0x0 confInc: 0 My confInc: 0
    2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_dump_group_priv:
    CSS Msg Hdr: vers: 0 type: UNKNOWN (0) chunks: NO MORE CHUNKS (0) transport:
    UNKNOWN (0) mSize: 0
    2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_dump_group_priv:
    Group Private Data is not of type DSKM_MSG_SS_REQ. Not proceeding with msg
    dump
    2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_rac_ini15: Found
    my member number 1 to be busy. Waiting (attempts: 598) for OCSSD to clean up
    previous incarnation
    
    2) Core files generated, and many  stack dumps of diskmon futher enlarges the
    diskmon.log
    
       The following is frequently seen in the diskmon.log
    
    2010-12-17 03:22:28.855: [ DISKMON][17796:1089268032] dskm_rac_ini16: OCSSD
    has notified that another diskmon is currently running in this node.
    This might be a duplicate startup. If not consult OCSSD log for additional
    information.
    2010-12-17 03:22:28.855: [ DISKMON][17796] INCIDENT : EXCEPTION (SIGNAL: 6)
    in [gsignal()]
    2010-12-17 03:22:28.855: [ DISKMON][17796] Thread 1089268032 got exception 6
    2010-12-17 03:22:28.855: [ DISKMON][17796] Stack dump for thread 1089268032
    [ DISKMON][17796]
    ....
  • 相关阅读:
    [LeetCode] 617. Merge Two Binary Trees
    [LeetCode] 738. Monotone Increasing Digits
    289. Game of Life
    305. Number of Islands II
    288. Unique Word Abbreviation
    271. Encode and Decode Strings
    393. UTF-8 Validation
    317. Shortest Distance from All Buildings
    286. Walls and Gates
    296. Best Meeting Point
  • 原文地址:https://www.cnblogs.com/macleanoracle/p/2967667.html
Copyright © 2011-2022 走看看