SMON的作用还包括清理死事务:Recover Dead transaction。当服务进程在提交事务(commit)前就意外终止的话会形成死事务(dead transaction),PMON进程负责轮询Oracle进程,找出这类意外终止的死进程(dead process),通知SMON将与该dead process相关的dead transaction回滚清理,并且PMON还负责恢复dead process原本持有的锁和latch。
我们来具体了解dead transaction的恢复过程:
SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE 10.2.0.4.0 Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production
SQL> select * from global_name;
GLOBAL_NAME
--------------------------------------------------------------------------------
www.oracledatabase12g.com
SQL>alter system set fast_start_parallel_rollback=false;
System altered.
设置10500,10046事件以跟踪SMON进程的行为
SQL> alter system set events '10500 trace name context forever,level 8';
System altered.
SQL> oradebug setospid 4424
Oracle pid: 8, Unix process pid: 4424, image: oracle@rh2.oracle.com (SMON)
SQL> oradebug event 10046 trace name context forever,level 8;
Statement processed.
在一个新的terminal中执行大批量的删除语句,在执行一段时间后使用操作系统命令将执行该删除操作的
服务进程kill掉,模拟一个大的dead transaction的场景
SQL> delete large_rb;
delete large_rb
[oracle@rh2 bdump]$ kill -9 4535
等待几秒后pmon进程会找出dead process:
[claim lock for dead process][lp 0x7000003c70ceff0][p 0x7000003ca63dad8.1290666][hist x9a514951]
在x$ktube内部视图中出现ktuxecfl(Transaction flags)标记为DEAD的记录:
SQL> select sum(distinct(ktuxesiz)) from x$ktuxe where ktuxecfl = 'DEAD';
SUM(DISTINCT(KTUXESIZ))
-----------------------
29386
SQL> /
SUM(DISTINCT(KTUXESIZ))
-----------------------
28816
以上KTUXESIZ代表事务所使用的undo块总数(number of undo blocks used by the transaction)
==================smon trace content==================
SMON: system monitor process posted
WAIT #0: nam='log file switch completion' ela= 0 p1=0 p2=0 p3=0 obj#=1 tim=1278243332801935
WAIT #0: nam='log file switch completion' ela= 0 p1=0 p2=0 p3=0 obj#=1 tim=1278243332815568
WAIT #0: nam='latch: row cache objects' ela= 95 address=2979418792 number=200 tries=1 obj#=1 tim=1278243333332734
WAIT #0: nam='latch: row cache objects' ela= 83 address=2979418792 number=200 tries=1 obj#=1 tim=1278243333356173
WAIT #0: nam='latch: undo global data' ela= 104 address=3066991984 number=187 tries=1 obj#=1 tim=1278243347987705
WAIT #0: nam='latch: object queue header operation' ela= 89 address=3094817048 number=131 tries=0 obj#=1 tim=1278243362468042
WAIT #0: nam='log file switch (checkpoint incomplete)' ela= 0 p1=0 p2=0 p3=0 obj#=1 tim=1278243419588202
Dead transaction 0x00c2.008.0000006d recovered by SMON
=====================
PARSING IN CURSOR #3 len=358 dep=1 uid=0 oct=3 lid=0 tim=1278243423594568 hv=3186851936 ad='ae82c1b8'
select smontabv.cnt,
smontab.time_mp,
smontab.scn,
smontab.num_mappings,
smontab.tim_scn_map,
smontab.orig_thread
from smon_scn_time smontab,
(select max(scn) scnmax,
count(*) + sum(NVL2(TIM_SCN_MAP, NUM_MAPPINGS, 0)) cnt
from smon_scn_time
where thread = 0) smontabv
where smontab.scn = smontabv.scnmax
and thread = 0
END OF STMT
PARSE #3:c=0,e=1354526,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1278243423594556
EXEC #3:c=0,e=106,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=1278243423603269
FETCH #3:c=0,e=47065,p=0,cr=319,cu=0,mis=0,r=1,dep=1,og=4,tim=1278243423650375
*** 2011-06-24 21:19:25.899
WAIT #0: nam='smon timer' ela= 299999999 sleep time=300 failed=0 p3=0 obj#=1 tim=1278243716699171
kglScanDependencyHandles4Unpin():
cumscan=3 cumupin=4 time=776 upinned=0
以上SMON回滚清理Dead transaction的过程从"system monitor process posted"开始到"Dead transaction 0x00c2.008.0000006d recovered by SMON"结束。另外可以看到在恢复过程中SMON先后请求了'latch: row cache objects'、'latch: undo global data'、'latch: object queue header operation'三种不同类型的latch。
现象
fast_start_parallel_rollback参数决定了SMON在回滚事务时使用的并行度,若将该参数设置为false那么并行回滚将被禁用,若设置为Low(默认值)那么会以2*CPU_COUNT数目的并行度回滚,当设置为High则4*CPU_COUNT数目的回滚进程将参与进来。当我们通过以下查询发现系统中存在大的dead tranacation需要回滚时我们可以通过设置fast_start_parallel_rollback为HIGH来加速恢复:
select sum(distinct(ktuxesiz)) from x$ktuxe where ktuxecfl = 'DEAD';
==============parallel transaction recovery===============
*** 2011-06-24 20:31:01.765
SMON: system monitor process posted msgflag:0x0000 (-/-/-/-/-/-/-)
*** 2011-06-24 20:31:01.765
SMON: process sort segment requests begin
*** 2011-06-24 20:31:01.765
SMON: process sort segment requests end
*** 2011-06-24 20:31:01.765
SMON: parallel transaction recovery begin
WAIT #0: nam='DFS lock handle' ela= 504 type|mode=1413545989 id1=3 id2=11 obj#=2 tim=1308918661765715
WAIT #0: nam='DFS lock handle' ela= 346 type|mode=1413545989 id1=3 id2=12 obj#=2 tim=1308918661766135
WAIT #0: nam='DFS lock handle' ela= 565 type|mode=1413545989 id1=3 id2=13 obj#=2 tim=1308918661766758
WAIT #0: nam='DFS lock handle' ela= 409 type|mode=1413545989 id1=3 id2=14 obj#=2 tim=1308918661767221
WAIT #0: nam='DFS lock handle' ela= 332 type|mode=1413545989 id1=3 id2=15 obj#=2 tim=1308918661767746
WAIT #0: nam='DFS lock handle' ela= 316 type|mode=1413545989 id1=3 id2=16 obj#=2 tim=1308918661768146
WAIT #0: nam='DFS lock handle' ela= 349 type|mode=1413545989 id1=3 id2=17 obj#=2 tim=1308918661768549
WAIT #0: nam='DFS lock handle' ela= 258 type|mode=1413545989 id1=3 id2=18 obj#=2 tim=1308918661768858
WAIT #0: nam='DFS lock handle' ela= 310 type|mode=1413545989 id1=3 id2=19 obj#=2 tim=1308918661769224
WAIT #0: nam='DFS lock handle' ela= 281 type|mode=1413545989 id1=3 id2=20 obj#=2 tim=1308918661769555
*** 2011-06-24 20:31:01.769
SMON: parallel transaction recovery end
但是在real world的实践中可以发现当fast_start_parallel_rollback= Low/High,即启用并行回滚时常有并行进程因为各种资源互相阻塞导致回滚工作停滞的例子,当遭遇到这种问题时将fast_start_parallel_rollback设置为FALSE一般可以保证恢复工作以串行形式在较长时间内完成。
如何禁止SMON Recover Dead transaction
可以设置10513事件来临时禁止SMON恢复死事务,这在我们做某些异常恢复的时候显得异常有效,当然不建议在一个正常的生产环境中设置这个事件:
SQL> alter system set events '10513 trace name context forever, level 2';
System altered.
10531 -- event disables transaction recovery which was initiated by SMON
SQL> select ktuxeusn,
2 to_char(sysdate, 'DD-MON-YYYY HH24:MI:SS') "Time",
3 ktuxesiz,
4 ktuxesta
5 from x$ktuxe
6 where ktuxecfl = 'DEAD';
KTUXEUSN Time KTUXESIZ KTUXESTA
---------- -------------------------- ---------- ----------------
17 24-JUN-2011 22:03:10 0 INACTIVE
66 24-JUN-2011 22:03:10 0 INACTIVE
105 24-JUN-2011 22:03:10 0 INACTIVE
193 24-JUN-2011 22:03:10 33361 ACTIVE
194 24-JUN-2011 22:03:10 0 INACTIVE
194 24-JUN-2011 22:03:10 0 INACTIVE
197 24-JUN-2011 22:03:10 20171 ACTIVE
7 rows selected.
SQL> /
KTUXEUSN Time KTUXESIZ KTUXESTA
---------- -------------------------- ---------- ----------------
17 24-JUN-2011 22:03:10 0 INACTIVE
66 24-JUN-2011 22:03:10 0 INACTIVE
105 24-JUN-2011 22:03:10 0 INACTIVE
193 24-JUN-2011 22:03:10 33361 ACTIVE
194 24-JUN-2011 22:03:10 0 INACTIVE
194 24-JUN-2011 22:03:10 0 INACTIVE
197 24-JUN-2011 22:03:10 20171 ACTIVE
7 rows selected.
================smon disabled trans recover trace==================
SMON: system monitor process posted
*** 2011-06-24 22:02:57.980
SMON: Event 10513 is level 2, trans recovery disabled.