zoukankan      html  css  js  c++  java
  • 超融合硬件损坏导致Oracle RAC异常恢复实录

    墨墨导读:一套Oracle RAC环境运行在HW超融合环境中,由于硬件问题导致数据库crash,期间出现了不少数据坏块,本文详述整个恢复过程,希望对大家有帮助。

    前几天某客户遇到一个棘手问题:其一套Oracle RAC环境运行在HW超融合环境中,由于硬件问题导致数据库crash,期间出现了不少数据坏块,不过还好客户有RMAN物理备份,因此客户提前进行了全库Restore。

    首先我们来看下相关日志信息:

    Trying mirror side DATADG_0002.
    Reread of blocknum=145005, file=+DATADG/xxxxx/datafile/supervise. found same corrupt data
    Reread of blocknum=145005, file=+DATADG/xxxxx/datafile/supervise. found valid data
    Sun Apr 26 01:22:35 2020
    Hex dump of (file 11, block 230454) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_13254.trc
    Corrupt block relative dba: 0x02c38436 (file 11, block 230454)
    Bad header found during backing up datafile
    Data in bad block:
    type: 6 format: 2 rdba: 0x0082f736
    last change scn: 0x0000.4b5a95d7 seq: 0x1 flg: 0x06
    spare1: 0x0 spare2: 0x0 spare3: 0x0
    consistency value in tail: 0x95d70601
    check value in block header: 0xa3d1
    computed block checksum: 0x0
    Trying mirror side DATADG_0002.
    Reread of blocknum=230454, file=+DATADG/xxxxx/datafile/supdb_etyy.dbf. found same corrupt data
    Reread of blocknum=230454, file=+DATADG/xxxxx/datafile/supdb_etyy.dbf. found valid data
    Hex dump of (file 11, block 230458) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_13254.trc
    Corrupt block relative dba: 0x02c3843a (file 11, block 230458)
    

    上述类似的大量坏块信息,最终导致数据库挂掉。

    我们可以看到客户这里asm diskgroup为normal冗余;当primary extent数据有问题时,Oracle会尝试从mirror extent去获取;如果mirror extent是正常的,那么Oracle会自动进行修复,否则会导致数据丢失,严重的话会导致数据库宕机。

    Mon Apr 27 02:01:15 2020
    Read of datafile '+DATADG/orclsk/datafile/sysaux.260.956077231' (fno 2) header failed with ORA-01207
    Rereading datafile 2 header from mirror side 'DATADG_0000' failed with ORA-01207
    Rereading datafile 2 header from mirror side 'DATADG_0006' failed with ORA-01207
    Errors in file /u01/app/oracle/diag/rdbms/orclsk/orclsk1/trace/orclsk1_ckpt_4893.trc:
    ORA-63999: data file suffered media failure
    ORA-01122: database file 2 failed verification check
    ORA-01110: data file 2: '+DATADG/orclsk/datafile/sysaux.260.956077231'
    ORA-01207: file is more recent than control file - old control file
    Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ckpt_4893.trc:
    ORA-63999: data file suffered media failure
    ORA-01122: database file 2 failed verification check
    ORA-01110: data file 2: '+DATADG/xxxxx/datafile/sysaux.260.956077231'
    ORA-01207: file is more recent than control file - old control file
    CKPT (ospid: 4893): terminating the instance due to error 63999
    

    因为部分文件需要介质恢复(因为primary和mirror 数据都异常),而数据库强行终止了实例。

    数据库实例重启后则开始出现大量错误:

    Mon Apr 27 02:02:03 2020
    Read of datafile '+DATADG/xxxxx/datafile/mscsz.dbf' (fno 40) header failed with ORA-01207
    Rereading datafile 40 header from mirror side 'DATADG_0000' failed with ORA-01207
    Rereading datafile 40 header from mirror side 'DATADG_0005' failed with ORA-01207
    Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_dbw0_115709.trc:
    ORA-01186: file 40 failed verification tests
    ORA-01122: database file 40 failed verification check
    ORA-01110: data file 40: '+DATADG/xxxxx/datafile/mscsz.dbf'
    ORA-01207: file is more recent than control file - old control file
    File 40 not verified due to error ORA-01122
    .......
    Mon Apr 27 02:02:07 2020
    QMNC started with pid=46, OS id=115887 
    Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_115771.trc  (incident=299881):
    ORA-01578: ORACLE data block corrupted (file # 1, block # 3620)
    ORA-01110: data file 1: '+DATADG/xxxxx/datafile/system.259.956077225'
    Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_299881/xxxxx1_ora_115771_i299881.trc
    Mon Apr 27 02:02:16 2020
    Dumping diagnostic data in directory=[cdmp_20200427020216], requested by (instance=1, osid=115771), summary=[incident=299881].
    Mon Apr 27 02:02:17 2020
    Sweep [inc][299881]: completed
    Mon Apr 27 02:02:18 2020
    Completed: ALTER DATABASE OPEN /* db agent *//* {0:13:1899} */
    Mon Apr 27 02:02:18 2020
    ORA-21700 encountered when generating server alert SMG-4120
    ORA-21700 encountered when generating server alert SMG-4121
    ORA-21700 encountered when generating server alert SMG-4121
    Mon Apr 27 02:02:20 2020
    Starting background process CJQ0
    Mon Apr 27 02:02:20 2020
    CJQ0 started with pid=54, OS id=115990 
    Mon Apr 27 02:02:20 2020
    Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_m004_115984.trc  (incident=300025):
    ORA-01578: ORACLE data block corrupted (file # 2, block # 101561)
    ORA-01110: data file 2: '+DATADG/xxxxx/datafile/sysaux.260.956077231'
    Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_300025/xxxxx1_m004_115984_i300025.trc
    Dumping diagnostic data in directory=[cdmp_20200427020222], requested by (instance=1, osid=115984 (M004)), summary=[incident=300025].
    Mon Apr 27 02:02:22 2020
    Sweep [inc][300025]: completed
    Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_m004_115984.trc  (incident=300026):
    ORA-01578: ORACLE data block corrupted (file # , block # )
    Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_300026/xxxxx1_m004_115984_i300026.trc
    Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_m004_115984.trc  (incident=300027):
    ORA-01578: ORACLE data block corrupted (file # , block # )
    ORA-06512: at "SYS.PRVT_ADVISOR", line 6144
    ORA-06512: at line 1
    Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_300027/xxxxx1_m004_115984_i300027.trc
    Dumping diagnostic data in directory=[cdmp_20200427020224], requested by (instance=1, osid=115984 (M004)), summary=[incident=300026].
    Dumping diagnostic data in directory=[cdmp_20200427020224], requested by (instance=1, osid=115984 (M004)), summary=[incident=300027].
    Mon Apr 27 02:02:46 2020
    Hex dump of (file 1, block 24895) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_116134.trc
    Corrupt block relative dba: 0x0040613f (file 1, block 24895)
    Bad header found during buffer read
    Data in bad block:
    type: 6 format: 2 rdba: 0x00405c3f
    last change scn: 0x0000.0002a412 seq: 0x1 flg: 0x06
    spare1: 0x0 spare2: 0x0 spare3: 0x0
    consistency value in tail: 0xa4120601
    check value in block header: 0xa486
    computed block checksum: 0x0
    Reading datafile '+DATADG/xxxxx/datafile/system.259.956077225' for corruption at rdba: 0x0040613f (file 1, block 24895)
    Read datafile mirror 'DATADG_0001' (file 1, block 24895) found same corrupt data (no logical check)
    Read datafile mirror 'DATADG_0003' (file 1, block 24895) found valid data
    Hex dump of (file 1, block 24895) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_116134.trc
    Repaired corruption at (file 1, block 24895)
    

    不难看出数据库控制文件和system都出现了异常,这也难怪数据库无法正常打开。

    期间还出现了ora-00600 [4193]等常见错误:

    Mon Apr 27 02:55:26 2020
    QMNC started with pid=44, OS id=3072 
    Mon Apr 27 02:55:26 2020
    Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_mmon_2939.trc  (incident=323867):
    ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
    Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_323867/xxxxx1_mmon_2939_i323867.trc
    Use ADRCI or Support Workbench to package the incident.
    See Note 411.1 at My Oracle Support for error and packaging details.
    Completed: ALTER DATABASE OPEN /* db agent *//* {0:17:166} */
    ARC3: Archival started
    ARC0: STARTING ARCH PROCESSES COMPLETE
    Mon Apr 27 02:55:29 2020
    Dumping diagnostic data in directory=[cdmp_20200427025529], requested by (instance=1, osid=2939 (MMON)), summary=[incident=323867].
    Block recovery from logseq 17833, block 46 to scn 1551059886
    Recovery of Online Redo Log: Thread 1 Group 1 Seq 17833 Reading mem 0
    Mem# 0: +DATADG/xxxxx/onlinelog/group_1.257.956077213
    Mem# 1: +ARCHDG/xxxxx/onlinelog/group_1.257.956077217
    Block recovery stopped at EOT rba 17833.54.16
    Block recovery completed at rba 17833.54.16, scn 0.1551059805
    Block recovery from logseq 17833, block 46 to scn 1551059575
    Recovery of Online Redo Log: Thread 1 Group 1 Seq 17833 Reading mem 0
    Mem# 0: +DATADG/xxxxx/onlinelog/group_1.257.956077213
    Mem# 1: +ARCHDG/xxxxx/onlinelog/group_1.257.956077217
    Block recovery completed at rba 17833.46.16, scn 0.1551059664
    Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_mmon_2939.trc  (incident=323868):
    ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
    Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_323868/xxxxx1_mmon_2939_i323868.trc
    Use ADRCI or Support Workbench to package the incident.
    See Note 411.1 at My Oracle Support for error and packaging details.
    Dumping diagnostic data in directory=[cdmp_20200427025534], requested by (instance=1, osid=2939 (MMON)), summary=[incident=323868].
    Block recovery from logseq 17833, block 46 to scn 1551059886
    Recovery of Online Redo Log: Thread 1 Group 1 Seq 17833 Reading mem 0
    Mem# 0: +DATADG/xxxxx/onlinelog/group_1.257.956077213
    Mem# 1: +ARCHDG/xxxxx/onlinelog/group_1.257.956077217
    Block recovery completed at rba 17833.54.16, scn 0.1551059888
    Block recovery from logseq 17833, block 46 to scn 1551059888
    Recovery of Online Redo Log: Thread 1 Group 1 Seq 17833 Reading mem 0
    Mem# 0: +DATADG/xxxxx/onlinelog/group_1.257.956077213
    Mem# 1: +ARCHDG/xxxxx/onlinelog/group_1.257.956077217
    Block recovery completed at rba 17833.56.16, scn 0.1551059889
    Mon Apr 27 02:55:36 2020
    。。。。。。
    

    对于此次恢复case总体来讲比较简单,这里提供一下处理思路:

    1. 首先利用客户的归档和Redo(Redo log客户已copy到了本地进行备份),进行正常的recover database using backup controlfile until cancel操作;

    2. 由于恢复过程中出现了ora-00600 [3020]错误,因此需要通过recover database allow xx corruption的方式进行;

    3. 完成恢复之后尝试打开数据库;

    4. 打开数据库时仍然提示ora-01113和ora-01110错误,即system文件还需要进行恢复;这种情况下只能先强制拉库;通过加入*._allow_resetlogs_corruption=TRUE *._allow_error_simulation=true 等隐含参数即可;

    5. 若上述参数后仍然提示undo 存在坏块,由于该数据库版本在报错时不会直接提示具体是哪个回滚段有问题,在alter database open resetlogs之前通过10046 event定位error cursor,找到回滚段名称即可;

    6. 使用_corrupted_rollback_segments参数屏蔽回滚段;

    7. open resetlogs数据库成功;

    8. 最后就是重建undo以及处理相关坏块等善后工作。

     

    恢复完成之后,由于客户担心HW超融合环境再次出现故障因此进行了全库备份并进行数据迁移到新平台,到这里这个case告一段落。

     

    再次叮嘱大家,注意数据库备份、注意数据容灾环境建设!

    推荐阅读:144页!分享珍藏已久的数据库技术年刊

    数据和云

    ID:OraNews

    如有收获,请划至底部,点击“在看”,谢谢!

    点击下图查看更多 ↓

    云和恩墨大讲堂 | 一个分享交流的地方

    长按,识别二维码,加入万人交流社群

    请备注:云和恩墨大讲堂

      点个“在看”

    你的喜欢会被看到❤

  • 相关阅读:
    Restful
    RESTful
    UUID使用
    Security注解:@PreAuthorize,@PostAuthorize, @Secured, EL实现方法安全
    java Spring定时器 每个季度执行一次
    给文字加小手样式
    java 根据年月获取当前年月的最后一天,获取当前年月有多少天
    mui Picker DtPicker 固定内容以及变动内容Picker 不同情况下的应用 提高效率
    java封装返回结果应用 利用反射机制 返回结果以及错误信息
    java 封装返回结果实体类 返回结果以及错误信息
  • 原文地址:https://www.cnblogs.com/hzcya1995/p/13311680.html
Copyright © 2011-2022 走看看