zoukankan      html  css  js  c++  java
  • CASE:DB shutdown/open 过程中发生异常导致JOB不能自动执行

    CASE:DB shutdown/open 过程中发生异常导致JOB不能自动执行

    现象:
    一个DB中的所有JOB在3月25日之后就不再自动运行,查询DBA_JOBS,发现LAST_DATE定格在3月25日之前。
    SQL> show parameter job

    NAME TYPE VALUE
    ------------------------------------ ----------- ------------------------------
    job_queue_processes integer 5

    SQL> select job,LAST_DATE,LAST_SEC,NEXT_DATE,NEXT_SEC,TOTAL_TIME,BROKEN,INTERVAL,what from dba_jobs;

     JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC TOTAL_TIME BROKEN INTERVAL WHAT
    ---------- ----------- ---------------- ----------- ---------------- ---------- ------ ---------------------------------------- --------------------------------------------------
     1 2013/3/25 1 17:26:39 2013/3/25 1 17:27:39 180875 N sysdate + 1 / (24 * 60) EMD_MAINTENANCE.EXECUTE_EM_DBMS_JOB_PROCS();
     3 2013/3/25 6 06:10:18 2013/5/3 11 11:22:47 758193 N sysdate+1 P_IMG_CSU_STAT_INFO;
     4 2013/3/22 1 12:07:20 2013/5/3 10 10:33:37 4165016 N sysdate + 1 p_IMG_csu_stat_daily_ex_exec;
     5 2013/3/23 1 14:27:31 2013/4/10 6 06:00:00 13238996 N sysdate + 1 P_IMG_CSU_TWS_MONTHLY_EXEC;
     122 2013/3/25 1 10:55:58 2013/4/10 1 10:55:58 140067 N sysdate+1 p_IMG_csu_stat_wt_daily_exec;
     42 2013/3/25 1 17:23:29 2013/4/10 8 08:00:00 6803650 N SYSDATE+1/288 qpg_kj_tj;
     62 2013/3/22 0 00:10:05 2013/4/10 0 00:10:00 7147390 N TRUNC(SYSDATE)+1/(24*6)+1 neimg.pack_qpg_IMG_dailytj.DAILY_STATISTICS;
     63 2013/3/22 0 00:00:01

    分析:
    Step1,查看dba_jobs等视图
    Step2,查看alert日志:
    Mon Mar 25 09:05:22 2013
    Errors in file /app/oracle/admin/neimg/bdump/neimg_j006_27877.trc:  <===
    ORA-12012: error on auto execute of job 42
    ORA-01403: no data found
    ORA-06512: at "neimg.qpg_KJ_TJ", line 33
    ORA-06512: at line 1
    Mon Mar 25 09:08:29 2013

    通过alert日志的报错,追溯到neimg_j006_27877.trc,但没有有价值的信息。
    Step3,从OS层面trace进程cjq0
    [oracle@IMG_PIDB udump]$ ps -ef|grep cjq
    oracle 4161 4051 0 08:30 pts/0 00:00:00 grep cjq
    oracle 11984 1 0 Mar25 ? 00:00:02 ora_cjq0_neimg
    [oracle@IMG_PIDB udump]$ strace -p 11984
    Process 11984 attached - interrupt to quit
    times(NULL) = 790492277
    semtimedop(393217, 0x7fff39107150, 1, {1, 840000000}) = -1 EAGAIN (Resource temporarily unavailable)
    times(NULL) = 790492461
    times(NULL) = 790492461
    semtimedop(393217, 0x7fff39107150, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable) <<====
    getrusage(RUSAGE_SELF, {ru_utime={1, 664746}, ru_stime={0, 500923}, ...}) = 0
    getrusage(RUSAGE_SELF, {ru_utime={1, 664746}, ru_stime={0, 500923}, ...}) = 0
    times(NULL) = 790492761
    times(NULL) = 790492761
    semtimedop(393217, 0x7fff39107150, 1, {3, 0} <unfinished ...>
    Process 11984 detached

    semtimedop(393217, 0x7fff39107150, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)
    通过trace,发现自从3月25日DB重启后一直在semtimedop内存段’0x7fff39107150'

    Step4,删除job 4,并重建,新的job号是292,但是新建的JOB 292仍然不能自动执行。列THIS_DATE,THIS_SEC值为空。

    Step5,做10046,如下:
    SQL> conn /as sysdba
    Connected.
    SQL> oradebug setospid 11984;
    oradebug unlimit;
    oradebug event 10046 trace name context forever,level 12;
    oradebug tracefile_nameOracle pid: 13, Unix process pid: 11984, image: oracle@IMG_PIDB (CJQ0)
    SQL> Statement processed.
    SQL> Statement processed.
    SQL>
    /app/oracle/admin/neimg/bdump/neimg_cjq0_11984.trc
    SQL> conn neimg
    Enter password:
    Connected.
    SQL> exec DBMS_JOB.NEXT_DATE(292,sysdate+1/24/60);  <===设定为1分钟后执行

    PL/SQL procedure successfully completed.

    SQL> commit;

    Commit complete.

    SQL>

    Unix process pid: 11984, image: oracle@IMG_PIDB (CJQ0)

    *** 2013-05-07 09:12:19.739
    Received ORADEBUG command 'unlimit' from process Unix process pid: 30143, image:
    Received ORADEBUG command 'event 10046 trace name context forever,level 12' from process Unix process pid: 30143, image:
    WAIT #0: nam='rdbms ipc message' ela= 23719 timeout=237 p2=0 p3=0 obj#=-1 tim=1335829238050367
    WAIT #0: nam='rdbms ipc message' ela= 2295845 timeout=235 p2=0 p3=0 obj#=-1 tim=1335829240346359
    WAIT #0: nam='rdbms ipc message' ela= 2930432 timeout=300 p2=0 p3=0 obj#=-1 tim=1335829243276873
    Received ORADEBUG command 'tracefile_name' from process Unix process pid: 30143, image:
    WAIT #0: nam='rdbms ipc message' ela= 1843426 timeout=300 p2=0 p3=0 obj#=-1 tim=1335829245120391
    WAIT #0: nam='rdbms ipc message' ela= 1084937 timeout=111 p2=0 p3=0 obj#=-1 tim=1335829246205459
    *** 2013-05-07 09:12:31.115
    WAIT #0: nam='rdbms ipc message' ela= 2930461 timeout=300 p2=0 p3=0 obj#=-1 tim=1335829249136013
    WAIT #0: nam='rdbms ipc message' ela= 2930386 timeout=300 p2=0 p3=0 obj#=-1 tim=1335829252066575
    WAIT #0: nam='rdbms ipc message' ela= 2930460 timeout=300 p2=0 p3=0 obj#=-1 tim=1335829254997126
    WAIT #0: nam='rdbms ipc message' ela= 2931251 timeout=300 p2=0 p3=0 obj#=-1 tim=1335829257928469

    并作了system dump:

    === ODM Data Collection ===

    PROCESS 13:
     ----------------------------------------
     SO: 0x1f218b2f0, type: 2, owner: (nil), flag: INIT/-/-/0x00
     (process) Oracle pid=13, calls cur/top: 0x1f62141c0/0x1f62141c0, flag: (2) SYSTEM
     int error: 0, call error: 0, sess error: 0, txn error 0
     (post info) last post received: 0 0 33
     last post received-location: ksrpublish
     last process to post me: 1f41b0ab0 224 0
     last post sent: 0 0 48
     last post sent-location: ksoreq_reply
     last process posted by me: 1fe1dc698 1 0
     (latch info) wait_event=0 bits=0
     Process Group: DEFAULT, pseudo proc: 0x1f31ae220
     O/S info: user: oracle, term: UNKNOWN, ospid: 11984
     OSD pid info: Unix process pid: 11984, image: oracle@IMG_PIDB (CJQ0)
     Short stack dump: ksdxfstk()+32<-ksdxcb()+1547<-sspuser()+90<-<0x3d0180eb10>

     ----------------------------------------
     SO: 0x1f220f3b8, type: 4, owner: 0x1f218b2f0, flag: INIT/-/-/0x00
     (session) sid: 544 trans: (nil), creator: 0x1f218b2f0, flag: (51) USR/- BSY/-/-/-/-/-
     DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
     txn branch: (nil)
     oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
     waiting for 'rdbms ipc message' blocking sess=0x(nil) seq=47 wait_time=0 seconds since wait started=0
     timeout=39, =0, =0
     Dumping Session Wait History
     for 'rdbms ipc message' count=1 wait_time=2373861
     timeout=12c, =0, =0
     for 'rdbms ipc message' count=1 wait_time=2930517
     timeout=12c, =0, =0
     for 'rdbms ipc message' count=1 wait_time=2930558
     timeout=12c, =0, =0
     for 'rdbms ipc message' count=1 wait_time=2930496
     timeout=12c, =0, =0
     for 'rdbms ipc message' count=1 wait_time=2930497
     timeout=12c, =0, =0
     for 'rdbms ipc message' count=1 wait_time=2930666
     timeout=12c, =0, =0
     for 'rdbms ipc message' count=1 wait_time=2930559
     timeout=12c, =0, =0
     for 'rdbms ipc message' count=1 wait_time=2931411
     timeout=12c, =0, =0
     for 'rdbms ipc message' count=1 wait_time=2930440
     timeout=12c, =0, =0
     for 'rdbms ipc message' count=1 wait_time=2930679
     timeout=12c, =0, =0
     temporary object counter: 0

    还是没有一丝线索。
    Step6,查看Metalink上是否有相关的BUG或者案例(搜索job start)。
    Note = Jobs Not Executing Automatically [ID 313102.1]
    ==>
    11) Check for unsuccessful shutdowns:

    A shutdown immediate may get canceled because active sessions prevent the database close operation.

    Please review the alert log for the last two shutdown / startups and the messages:
    SHUTDOWN: Active sessions prevent database close operation

    Please refer to
    Note 434690.1 - Database Jobs Do Not Run After a Failed 'Shutdown Immediate'

    Database Jobs Do Not Run After a Failed 'Shutdown Immediate' [ID 434690.1]
    ==>
    Symptoms

    After issuing a 'shutdown immediate' command, if the command does not execute after an hour, the Oracle Server automatically cancels the operation. A message will appear in the alert log as follows:

    SHUTDOWN: Active sessions prevent database close operation

    Unfortunately this has the added effect of shutting down all the job queue slaves as part of the shutdown process. Any jobs listed in the DBA_JOBS view will not run. This can affect a wide variety of functions, such as statistics gathering, replication propagation, materialized view refresh, etc.
    Cause

    This is expected behavior. Once the 'shutdown' command is issued, the shutdown must proceed; the shutdown process cannot be undone. This is because the instance is not expected to live after a shutdown is canceled.
    那么是否可能是这两篇文档中提到的情况呢?客户恰恰是在3月25日对DB进行了重启,于是仔细看了那天shutdown和open的日志



    Mon Mar 25 13:11:48 2013
    Shutting down instance (immediate)
    License high water mark = 51
    Mon Mar 25 13:11:48 2013
    Stopping Job queue slave processes
    Mon Mar 25 13:11:53 2013
    Process OS id : 14300 alive after kill
    Errors in file
    Mon Mar 25 13:11:58 2013
    Job queue slave processes stopped
    All dispatchers and shared servers shutdown
    Mon Mar 25 13:12:09 2013
    PMON failed to acquire latch, see PMON dump
    PMON failed to acquire latch, see PMON dump
    PMON failed to acquire latch, see PMON dump
    PMON failed to acquire latch, see PMON dump
    PMON failed to acquire latch, see PMON dump
    ...
    Mon Mar 25 13:13:43 2013
    Completed: ALTER DATABASE CLOSE NORMAL
    Mon Mar 25 13:13:43 2013
    ALTER DATABASE DISMOUNT
    Completed: ALTER DATABASE DISMOUNT
    ARCH: Archival disabled due to shutdown: 1089
    Shutting down archive processes
    Archiving is disabled
    Archive process shutdown avoided: 0 active
    ARCH: Archival disabled due to shutdown: 1089
    Shutting down archive processes
    Archiving is disabled
    Archive process shutdown avoided: 0 active       <===shutdown过程没有问题
    Mon Mar 25 13:40:19 2013
    Starting ORACLE instance (normal)
    ...
    Mon Mar 25 17:27:03 2013
    Shutting down instance (abort)
    License high water mark = 55
    Instance terminated by USER, pid = 11609
    Mon Mar 25 17:27:22 2013
    Starting ORACLE instance (normal)
    ...
    Mon Mar 25 17:27:33 2013
    Errors in file /app/oracle/admin/neimg/udump/neimg_ora_11994.trc:
    ORA-00604: error occurred at recursive SQL level 2
    ORA-01013: user requested cancel of current operation ==> 启动过程中有客户有执行CANCEL的操作,并且有ora-7445报出
    Mon Mar 25 17:27:33 2013
    Errors in file /app/oracle/admin/neimg/udump/neimg_ora_11994.trc:
    ORA-07445: exception encountered: core dump [kkttrex()+1365] [SIGSEGV] [Address not mapped to object] [0x000000074] [] []
    ...
    Mon Mar 25 17:27:51 2013
    alter database open
    Mon Mar 25 17:27:51 2013
    ORA-1531 signalled during: alter database open... <===正常启动成功后会有'Completed: ALTER DATABASE OPEN'的信息,但这里没有
    Mon Mar 25 17:33:30 2013
    Shutting down archive processes
    Mon Mar 25 17:33:35 2013
    ARCH shutting down
    ARC2: Archival stopped
    所有极有可能是在DB启动中途的那次客户手动CANCEL引起的,建议客户尝试重启DB,看能否解决这个问题。
    shutdown abort
    startup restrict
    shutdown immediate
    startup
    重启DB之后问题解决。
    总结:看来DB在做SHUTDOWN或者OPEN中途不能随意CTRL+C执行CANCEL,否则会导致DB状态异常,影响系统的一些功能,出现各种各样诡异的问题。

  • 相关阅读:
    如何使用Python创建可视化对象
    Power Query中如何解析累积总计
    如何快速复制度量值?
    如何使用文本字段进行条件格式设置
    Power BI 3-4月功能更新培训合集
    2019微软Power BI 每月功能更新系列——Power BI 4月版本功能完整解读
    送你一份堆积柱形图小点心,请收下~
    命令行编译带外部包依赖的java源文件 [以JDBC MySQL8为例]
    JavaScript (JS)常用方法
    JDK8过渡到JDK11
  • 原文地址:https://www.cnblogs.com/fuhaots2009/p/3508690.html
Copyright © 2011-2022 走看看