zoukankan      html  css  js  c++  java
  • 质疑页修复实践

     
     

    背景

    最近我们在例行检查一台新迁不久的数据库服务器的JOB执行失败,查看历史记录,发现错误提示如下:
    Date        2019-01-06 0:25:00
    Log        Job History (新增门店更新数据)
    
    Step ID        1
    Server        ********
    Job Name        新增门店更新数据
    Step Name        更新信息
    Duration        00:00:47
    Sql Severity    24
    Sql Message ID    824
    Operator Emailed    
    Operator Net sent    
    Operator Paged    
    Retries Attempted    0
    
    Message
    Executed as user: NT AUTHORITYSYSTEM. SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0xca6a4ba5; actual: 0xca6a4b25). It occurred during a read of page (5:3768184) in database ID 5 at offset 0x0000072fef0000 in file 'F:data********_3.mdf'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. [SQLSTATE HY000] (Error 824).  The step failed.
     

    操作环境

    Microsoft SQL Server 2012 (SP4) (KB4018073) - 11.0.7001.0 (X64) 
     Aug 15 2017 10:23:29 
     Copyright (c) Microsoft Corporation
     Web Edition (64-bit) on Windows NT 6.3 <X64> (Build 9600: ) (Hypervisor)
     

    操作步骤

    通过上述的错误提示,初步判断是页质疑了,查看对应的页质疑的记录
    SELECT * FROM [msdb].[dbo].[suspect_pages];
    GO
     
    在问题数据库的查询窗口下,执行如下代码
     
    dbcc checkdb

    主要的错误信息如下:

    Msg 8939, Level 16, State 98, Line 1
    Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data), page (5:3768184). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 133129 and -4.
    Msg 8928, Level 16, State 1, Line 1
    Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data): Page (5:3768184) could not be processed.  See other errors for details.
    Msg 8976, Level 16, State 1, Line 1
    Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768184) was not seen in the scan although its parent (4:3692451) and previous (1:6408870) refer to it. Check any previous errors.
    Msg 8978, Level 16, State 1, Line 1
    Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768185) is missing a reference from previous page (5:3768184). Possible chain linkage problem.

    通过object id 2131746897,可以先查询一下是哪个表

    SELECT OBJECT_NAME(2131746897)
     
    如果是较大型的业务库,几百G以上的库,直接dbcc checkdb 耗时是非常久的,可以根据之前的job错误提示,直接执行如下:
    可以快速定位到哪个表,什么索引;
    本想尝试使用restore page ,可惜因维护策略已自动清理出现质疑前的备份,该方法只能放弃;
    尝试把全部记录插入新表也是失败
    Msg 824, Level 24, State 2, Line 1
    SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0xca6a4ba5; actual: 0xca6a4b25). It occurred during a read of page (5:3768184) in database ID 5 at offset 0x0000072fef0000 in file 'F:data******_3.mdf'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

    为了验证能否修复,我们把生产环境的备份拷贝到测试环境并恢复。

    a.先验证恢复的库问题是否真实存在;
    dbcc checktable('dbo.sfa_t_tmonvivid')
    Msg 8939, Level 16, State 98, Line 1
    Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data), page (5:3768184). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 133129 and -4.
    Msg 8928, Level 16, State 1, Line 1
    Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data): Page (5:3768184) could not be processed.  See other errors for details.
    Msg 8976, Level 16, State 1, Line 1
    Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768184) was not seen in the scan although its parent (4:3692451) and previous (1:6408870) refer to it. Check any previous errors.
    Msg 8978, Level 16, State 1, Line 1
    Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768185) is missing a reference from previous page (5:3768184). Possible chain linkage problem.
    DBCC results for 'sfa_t_TMonVivid'.
    There are 7548683 rows in 629541 pages for object "sfa_t_TMonVivid".
    CHECKTABLE found 0 allocation errors and 4 consistency errors in table 'sfa_t_TMonVivid' (object ID 2131746897).
    repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (*********.dbo.sfa_t_TMonVivid).
    DBCC execution completed. If DBCC printed error messages, contact your system administrator.
     
         暂时先不使用repair_allow_data_loss 修复,看错误提示是index id 1 试一下直接删除索引、再新建或重建;
    b.把该库置换成单用户;
         这个步骤方便维护后续进行丢失数据修复
    c.删除问题表上已有的索引或重建;
        在已打开的单用户模式下删除已有的索引
      重建索引
    都是报一样的错误,看来新建重建索引都不行。下图是未修改单用户时查看问题表的记录量的截图,单用户模式下可以使用 sp_spaceused [问题表名] 来查看记录数;
     
    d.使用丢失数据修复;
    dbcc checktable ('dbo.sfa_t_tmonvivid',repair_allow_data_loss )
    DBCC results for 'sfa_t_TMonVivid'.
    Repair: The Clustered index successfully rebuilt for the object "dbo.sfa_t_TMonVivid" in database "xw_dl_1008307_error".
    Repair: The page (5:3768184) has been deallocated from object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data).
    Msg 8945, Level 16, State 1, Line 1
    Table error: Object ID 2131746897, index ID 1 will be rebuilt.
            The error has been repaired.
    Msg 8928, Level 16, State 1, Line 1
    Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data): Page (5:3768184) could not be processed.  See other errors for details.
            The error has been repaired.
    Msg 8939, Level 16, State 98, Line 1
    Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data), page (5:3768184). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 2057 and -4.
            The error has been repaired.
    Msg 8976, Level 16, State 1, Line 1
    Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768184) was not seen in the scan although its parent (4:3692451) and previous (1:6408870) refer to it. Check any previous errors.
            The error has been repaired.
    Msg 8978, Level 16, State 1, Line 1
    Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768185) is missing a reference from previous page (5:3768184). Possible chain linkage problem.
            The error has been repaired.
    There are 7548683 rows in 629541 pages for object "sfa_t_TMonVivid".
    CHECKTABLE found 0 allocation errors and 4 consistency errors in table 'sfa_t_TMonVivid' (object ID 2131746897).
    CHECKTABLE fixed 0 allocation errors and 4 consistency errors in table 'sfa_t_TMonVivid' (object ID 2131746897).
    DBCC execution completed. If DBCC printed error messages, contact your system administrator.
     
    e.验证问题表问题是否已修复;
    f.结论
    质疑页的问题是解决了,但丢失了7548695-7548683=12条记录;
    好消息当时迁移完之后,原来旧的服务器还未释放资源即原来的环境还在且页未出现质疑,通过恢复问题数据库至旧的服务器上与原来的库对较一下,找出丢失的12条记录。该表sfa_t_tmonvivid 有一个自增主键字段VividID,只要比对旧服务器对应的表的最大VividID代码如下:
    use [迁移前的库名]
    select a.* into wen from sfa_t_tmonvivid a  left join [迁移后的生产备份库_error].dbo.sfa_t_tmonvivid b
    on a.VividID=b.VividID
    where b.VividID is null and a.VividID<=7531837
     
    再把表wen 的记录导成insert values 脚本,在问题库上执行
     
    USE [xw_dl_****]
    GO
    set identity_insert  [sfa_t_tmonvivid] on ;
    --INSERT [dbo].[sfa_t_tmonvivid] ([VividID], [StoreID], [UserNumber]
    --插入实际丢失的数据
    set identity_insert  [sfa_t_tmonvivid] off ;
     
    g.重新验证问题表,数据量一致且无错误提示
    看来平时备份并验证有效性是非常有必要,关键时刻能起大作用;

    参考

     
     
  • 相关阅读:
    psacct监视用户执行的命令,如cpu时间和内存战胜,实时进程记账
    iostat,mpstat,sar即时查看工具,sar累计查看工具
    sysstat服务负载统计,如CPU占有率,网络使用率,磁盘速度
    linux top命令VIRT,RES,SHR,DATA的含义
    lsof查看进程打开了哪些文件目录套接字
    linux系统监控常用工具
    Linux系统维护修复模式
    centos单用户模式:修改ROOT密码和grub加密
    Linux系统启动过程介绍
    Android原生APP内分享
  • 原文地址:https://www.cnblogs.com/jil-wen/p/10405054.html
Copyright © 2011-2022 走看看