zoukankan      html  css  js  c++  java
  • 数据库服务异常停止分析

    sqlserver数据库服务异常停止分析

    背景

    2019年11月13日早上9:37分钟,我们的运维监控系统告警提示某项目数据库服务1433端口无法访问,我们赶紧登陆服务器查看详情,发现数据库服务以及数据库agent服务都已停止,重新手工启动对应的服务恢复生产。接下来我们就需赶紧分析引起这个问题的原因以及如何去修复;
    服务器环境情况如下:
    Microsoft SQL Server 2017 (RTM-CU14) (KB4484710) - 14.0.3076.1 (X64) Mar 12 2019 19:29:19 Copyright (C) 2017 Microsoft Corporation Enterprise Edition: Core-based Licensing (64-bit) on Windows Server 2012 R2 Datacenter 6.3 <X64> (Build 9600: ) (Hypervisor) 
    硬件配置:8CPU , 32G memory,数据盘SSD云盘 550GiB (18300 IOPS)

    分析过程

    第一步 查询系统日志
    a.有个eventid 1000的错误
    - <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
    - <System>
      <Provider Name="Application Error" /> 
      <EventID Qualifiers="0">1000</EventID> 
      <Level>2</Level> 
      <Task>100</Task> 
      <Keywords>0x80000000000000</Keywords> 
      <TimeCreated SystemTime="2019-11-13T01:37:40.000000000Z" /> 
      <EventRecordID>8587</EventRecordID> 
      <Channel>Application</Channel> 
      <Computer>****************</Computer> 
      <Security /> 
      </System>
    - <EventData>
      <Data>sqlservr.exe</Data> 
      <Data>2017.140.3076.1</Data> 
      <Data>5c88918c</Data> 
      <Data>ntdll.dll</Data> 
      <Data>6.3.9600.19304</Data> 
      <Data>5c7f684f</Data> 
      <Data>c0000374</Data> 
      <Data>00000000000f1cd0</Data> 
      <Data>1b8c</Data> 
      <Data>01d564b93362c212</Data> 
      <Data>E:Program FilesMicrosoft SQL ServerMSSQL14.MSSQLSERVERMSSQLBinnsqlservr.exe</Data> 
      <Data>C:WindowsSYSTEM32
    tdll.dll</Data> 
      <Data>2db02342-05b6-11ea-80ba-a469ceaa5caa</Data> 
      <Data /> 
      <Data /> 
      </EventData>
      </Event>
     
     
     
    b.有个eventid 19019的错误;
    - <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
    - <System>
      <Provider Name="MSSQLSERVER" /> 
      <EventID Qualifiers="16384">19019</EventID> 
      <Level>2</Level> 
      <Task>2</Task> 
      <Keywords>0x80000000000000</Keywords> 
      <TimeCreated SystemTime="2019-11-13T01:37:51.000000000Z" /> 
      <EventRecordID>8589</EventRecordID> 
      <Channel>Application</Channel> 
      <Computer>*******************</Computer> 
      <Security /> 
      </System>
    - <EventData>
      <Data>MSSQLSERVER 服务意外终止.</Data> 
      <Binary>B9370000190000001000000069005A006100630075006900650062006400780031007300370069005A000000070000006D00610073007400650072000000</Binary> 
      </EventData>
      </Event>
     
     
    我是觉的这些信息不是非常明了,只是告知MSSQLSERVER服务确实在问题时间点意外终止;
    c.夹在两个错误日志中间,注意到一个eventid 1001 
    - <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
    - <System>
      <Provider Name="Windows Error Reporting" /> 
      <EventID Qualifiers="0">1001</EventID> 
      <Level>4</Level> 
      <Task>0</Task> 
      <Keywords>0x80000000000000</Keywords> 
      <TimeCreated SystemTime="2019-11-13T01:37:50.000000000Z" /> 
      <EventRecordID>8588</EventRecordID> 
      <Channel>Application</Channel> 
      <Computer>**********************</Computer> 
      <Security /> 
      </System>
    - <EventData>
      <Data /> 
      <Data>0</Data> 
      <Data>APPCRASH</Data> 
      <Data>不可用</Data> 
      <Data>0</Data> 
      <Data>sqlservr.exe</Data> 
      <Data>2017.140.3076.1</Data> 
      <Data>5c88918c</Data> 
      <Data>StackHash_f18b</Data> 
      <Data>6.3.9600.19304</Data> 
      <Data>5c7f684f</Data> 
      <Data>c0000374</Data> 
      <Data>PCH_99_FROM_ntdll+0x0000000000090D0A</Data> 
      <Data /> 
      <Data /> 
      <Data>C:UsersMSSQLSERVERAppDataLocalTempWERA49D.tmp.appcompat.txt C:UsersMSSQLSERVERAppDataLocalTempWERA6EF.tmp.WERInternalMetadata.xml C:ProgramDataMicrosoftWindowsWERReportQueueAppCrash_sqlservr.exe_4b5e12dc9b4c86bd34896c996afe289e6149ec98_cac3d1cb_cab_4d0da6ee	riagedump.dmp WERGenerationLog.txt</Data> 
      <Data>C:ProgramDataMicrosoftWindowsWERReportQueueAppCrash_sqlservr.exe_4b5e12dc9b4c86bd34896c996afe289e6149ec98_cac3d1cb_cab_4d0da6ee</Data> 
      <Data /> 
      <Data>0</Data> 
      <Data>2db02342-05b6-11ea-80ba-a469ceaa5caa</Data> 
      <Data>4</Data> 
      <Data /> 
      </EventData>
      </Event>
     
     
     
    发现系统已打出了dump文件了,直接把上述地址的文件,拷到本地来分析。
    通过windbg查看如下:
     # Child-SP          RetAddr           Call Site
    00 000000e1`14c192b8 00007ff8`6012b61e ntdll!NtWaitForMultipleObjects+0xa
    01 000000e1`14c192c0 00007ff8`6012b17c ntdll!RtlReportExceptionEx+0x452
    02 000000e1`14c19890 00007ff8`60151d26 ntdll!RtlReportException+0xbc
    03 000000e1`14c19920 00007ff8`600e1e26 ntdll!RtlReportCriticalFailure$filt$0+0x33
    04 000000e1`14c19950 00007ff8`600f030e ntdll!_C_specific_handler+0x96
    05 000000e1`14c199c0 00007ff8`600f349d ntdll!_GSHandlerCheck_SEH+0x76
    06 000000e1`14c199f0 00007ff8`600b48d7 ntdll!RtlpExecuteHandlerForException+0xd
    07 000000e1`14c19a20 00007ff8`600b3afd ntdll!RtlDispatchException+0x197
    08 000000e1`14c1a0f0 00007ff8`60151cd0 ntdll!RtlRaiseException+0x18d
    09 000000e1`14c1a8b0 00007ff8`60154f12 ntdll!RtlReportCriticalFailure+0x8c
    0a 000000e1`14c1a9c0 00007ff8`60155b10 ntdll!RtlpHeapHandleError+0x12
    0b 000000e1`14c1a9f0 00007ff8`6010a5ff ntdll!RtlpLogHeapFailure+0xa4
    0c 000000e1`14c1aa20 00007ff8`40d269d8 ntdll!RtlFreeHeap+0x74f2f
    0d 000000e1`14c1aac0 00007ff8`40d27692 msvcr120!free+0x1c
    0e 000000e1`14c1aaf0 00007ff8`31aa748e msvcr120!realloc+0x2e
    0f 000000e1`14c1ab20 00000000`00000000 psqlodbc35w+0x5748e
    从分析结果来看像是第三方功能引起,接下来就要去数据库去验证猜测
    第二步 数据库验证猜测
    select * from sys.dm_os_loaded_modules
     
     
     
    ODBC中设置如下:
    数据库中有个dblink如下:
                    
    跟对应项目负责人员了解到,最近他们有个临时需求使用job定时在9:30通过dblink与postgres做数据同步,基本时间点也能对上了,那问题基本已明确是第三方的dblink调用造成sql server 异常停止;
    那么接下的问题就是分析9:30调用的job的代码,就是运行一个存储过程,存储过程的详细代码如下:
    --同步并清洗**销量数据
    --1、分别将通过**link获取到的**和**销量数据同步到sfa_t_pflh_sales和sfa_t_drf_sales中
    --2、将同步过来的销量数据进行清洗,并按客户需求保存数据到sfa_t_KaSales
    CREATE PROCEDURE [dbo].[sfa_p_KaSalesLYNC]
    AS
    --同步****销量数据
    INSERT INTO sfa_t_pflh_sales
    SELECT *
    FROM OPENQUERY
         (POSTGRES,
          'select * from "tn_bflh_sales" where "tn_tenant_id" = ''****''and tn_product_code<>''合计''AND tn_dealstatus=0 '
         );
    --更新*****销量销量数据同步状态tn_dealstatus=1代表已同步
    UPDATE OPENQUERY
           (POSTGRES, 'select * from "tn_bflh_sales"')
    SET tn_dealstatus = 1
    WHERE tn_tenant_id = '****'
          AND tn_dealstatus = 0;
    
    --同步***销量数据
    INSERT INTO sfa_t_drf_sales
    SELECT *
    FROM OPENQUERY
         (POSTGRES, 'select * from "tn_drf_salses" where "tn_tenant_id" = ''****'' AND tn_dealstatus=0 ');
    
    --更新****销量销量数据同步状态tn_dealstatus=1代表已同步
    UPDATE OPENQUERY
           (POSTGRES, 'select * from "tn_drf_salses"')
    SET tn_dealstatus = 1
    WHERE tn_tenant_id = '****'
          AND tn_dealstatus = 0;
    
    --同步***link作业计划日志
    TRUNCATE  TABLE sfa_ka_tasklog;
    INSERT INTO sfa_ka_tasklog
    SELECT * 
    FROM OPENQUERY
         (POSTGRES,
          'SELECT t.tn_task_id,t.tn_ka_id,tn_ka_code,tn_dealers_account,t.tn_task_date,i.tn_status,i.tn_task_steps,i.tn_reak FROM tn_ka_task t LEFT JOIN tn_ka_task_info i on t.tn_task_id=i.tn_task_id
    where t.tn_tenant_id=''*****'''
         )
    	 ORDER BY tn_task_date,tn_task_steps
    
    DELETE FROM sfa_t_KaSales WHERE DATEDIFF(DAY, tn_date, GETDATE()) = 1--清洗数据前,先删除之前清洗好的前一天销量
    INSERT INTO sfa_t_KaSales
    --清洗******前一天销量数据
    SELECT tn_tenant_id,'****'kaname,
           tn_dealer_number,
    	   case    WHEN tn_supplier_code = '******' THEN
                   '***'
               WHEN tn_supplier_code = '*****' THEN
                   '***'
           END quyu,
           tn_supplier_code,
           tn_supplier_name,
           tn_end_date tn_date,
           tn_product_code,
           tn_product_desc producname,
           tn_store_code,
           tn_store_name,
           MAX(tn_sale_quantity) AS tn_day_sales,
           CAST(MAX(tn_sale_amount) AS NVARCHAR(20)) AS tn_sale_amount
    FROM sfa_t_pflh_sales
    WHERE DATEDIFF(DAY, tn_create_date, GETDATE()) = 0
          --AND tn_dealstatus = 0
    GROUP BY tn_tenant_id,
             tn_dealer_number,
             tn_supplier_code,
             tn_supplier_name,
             tn_end_date,
             tn_product_code,
             tn_product_desc,
             tn_store_code,
             tn_store_name
    UNION
    --清洗****前一天销量数据
    SELECT tn_tenant_id,'****'kaname,
           tn_dealer_number,
    	   CASE
               WHEN tn_dealer_number = '*****' THEN
                   '***'
               WHEN tn_dealer_number = '****' THEN
                   '***'
               WHEN tn_dealer_number = '****' THEN
                   '***'
           END quyu,
           CASE
               WHEN tn_dealer_number = '****' THEN
                   '****'
               WHEN tn_dealer_number = '*****' THEN
                   '****'
               WHEN tn_dealer_number = '****' THEN
                   '***'
           END tn_supplier_code,
           '******' tn_supplier_name,
           DATEADD(DAY, -1, tn_date) tn_date,
           tn_product_code,
           p.producname,
    	   '' tn_store_code,
           tn_store_name,      
           MAX(tn_day_sales) tn_day_sales,
           '' tn_sale_amount
    FROM dbo.sfa_t_drf_sales s
        LEFT JOIN dbo.sfa_t_KaProduct p
            ON s.tn_product_code = p.productcode
               AND p.KaName = 'drf'
    WHERE DATEDIFF(DAY, s.tn_create_date, GETDATE()) = 0
          --AND s.tn_dealstatus = 0
    GROUP BY tn_tenant_id,
             tn_dealer_number,
             tn_product_code,
             tn_store_name,
             p.producname,
             DATEADD(DAY, -1, tn_date);
    
    --为保持数据清洗速度,各***系统销量临时表只保留15天数据
    DELETE FROM  sfa_t_pflh_sales
    WHERE DATEDIFF(DAY, tn_create_date, GETDATE()) = 15
    
    DELETE FROM  sfa_t_drf_sales
    WHERE DATEDIFF(DAY, tn_create_date, GETDATE()) = 15
    GO
    
     
     
     
    /*--同步并清洗**销量数据
    --1、分别将通过**link获取到的**和**销量数据同步到sfa_t_pflh_sales和sfa_t_drf_sales中
    --2、将同步过来的销量数据进行清洗,并按客户需求保存数据到sfa_t_KaSales
    */
    CREATE PROCEDURE [dbo].[sfa_p_KaSalesLYNC]
    AS
    /*--同步****销量数据*/
    INSERT INTO sfa_t_pflh_sales
    SELECT *
    FROM OPENQUERY
         (POSTGRES,
          'select * from "tn_bflh_sales" where "tn_tenant_id" = ''****''and tn_product_code<>''合计''AND tn_dealstatus=0 '
         );
    /*--更新*****销量销量数据同步状态tn_dealstatus=1代表已同步*/
    UPDATE OPENQUERY
           (POSTGRES, 'select * from "tn_bflh_sales"')
    SET tn_dealstatus = 1
    WHERE tn_tenant_id = '****'
          AND tn_dealstatus = 0;
    
    /*--同步***销量数据*/
    INSERT INTO sfa_t_drf_sales
    SELECT *
    FROM OPENQUERY
         (POSTGRES, 'select * from "tn_drf_salses" where "tn_tenant_id" = ''****'' AND tn_dealstatus=0 ');
    
    /*--更新****销量销量数据同步状态tn_dealstatus=1代表已同步*/
    UPDATE OPENQUERY
           (POSTGRES, 'select * from "tn_drf_salses"')
    SET tn_dealstatus = 1
    WHERE tn_tenant_id = '****'
          AND tn_dealstatus = 0;
    
    /*--同步***link作业计划日志*/
    TRUNCATE  TABLE sfa_ka_tasklog;
    INSERT INTO sfa_ka_tasklog
    SELECT * 
    FROM OPENQUERY
         (POSTGRES,
          'SELECT t.tn_task_id,t.tn_ka_id,tn_ka_code,tn_dealers_account,t.tn_task_date,i.tn_status,i.tn_task_steps,i.tn_reak FROM tn_ka_task t LEFT JOIN tn_ka_task_info i on t.tn_task_id=i.tn_task_id
    where t.tn_tenant_id=''*****'''
         )
    	 ORDER BY tn_task_date,tn_task_steps
    
    DELETE FROM sfa_t_KaSales WHERE DATEDIFF(DAY, tn_date, GETDATE()) = 1  /*--清洗数据前,先删除之前清洗好的前一天销量*/
    INSERT INTO sfa_t_KaSales
    /*--清洗******前一天销量数据*/
    SELECT tn_tenant_id,'****'kaname,
           tn_dealer_number,
    	   case    WHEN tn_supplier_code = '******' THEN
                   '***'
               WHEN tn_supplier_code = '*****' THEN
                   '***'
           END quyu,
           tn_supplier_code,
           tn_supplier_name,
           tn_end_date tn_date,
           tn_product_code,
           tn_product_desc producname,
           tn_store_code,
           tn_store_name,
           MAX(tn_sale_quantity) AS tn_day_sales,
           CAST(MAX(tn_sale_amount) AS NVARCHAR(20)) AS tn_sale_amount
    FROM sfa_t_pflh_sales
    WHERE DATEDIFF(DAY, tn_create_date, GETDATE()) = 0
    /*      --AND tn_dealstatus = 0 */
    GROUP BY tn_tenant_id,
             tn_dealer_number,
             tn_supplier_code,
             tn_supplier_name,
             tn_end_date,
             tn_product_code,
             tn_product_desc,
             tn_store_code,
             tn_store_name
    UNION
    /*--清洗****前一天销量数据*/
    SELECT tn_tenant_id,'****'kaname,
           tn_dealer_number,
    	   CASE
               WHEN tn_dealer_number = '*****' THEN
                   '***'
               WHEN tn_dealer_number = '****' THEN
                   '***'
               WHEN tn_dealer_number = '****' THEN
                   '***'
           END quyu,
           CASE
               WHEN tn_dealer_number = '****' THEN
                   '****'
               WHEN tn_dealer_number = '*****' THEN
                   '****'
               WHEN tn_dealer_number = '****' THEN
                   '***'
           END tn_supplier_code,
           '******' tn_supplier_name,
           DATEADD(DAY, -1, tn_date) tn_date,
           tn_product_code,
           p.producname,
    	   '' tn_store_code,
           tn_store_name,      
           MAX(tn_day_sales) tn_day_sales,
           '' tn_sale_amount
    FROM dbo.sfa_t_drf_sales s
        LEFT JOIN dbo.sfa_t_KaProduct p
            ON s.tn_product_code = p.productcode
               AND p.KaName = 'drf'
    WHERE DATEDIFF(DAY, s.tn_create_date, GETDATE()) = 0
    /*      --AND s.tn_dealstatus = 0  */
    GROUP BY tn_tenant_id,
             tn_dealer_number,
             tn_product_code,
             tn_store_name,
             p.producname,
             DATEADD(DAY, -1, tn_date);
    
    /*--为保持数据清洗速度,各***系统销量临时表只保留15天数据  */
    DELETE FROM  sfa_t_pflh_sales
    WHERE DATEDIFF(DAY, tn_create_date, GETDATE()) = 15
    
    DELETE FROM  sfa_t_drf_sales
    WHERE DATEDIFF(DAY, tn_create_date, GETDATE()) = 15
    GO
     
     
     

    后记

    后续通过观察,发现该job某些时间段是可以正常执行,某些时间段是会失败;
    直接报错信息如下:
    Msg 109, Level 20, State 0, Line 6
    A transport-level error has occurred when receiving results from the server. (provider: Shared Memory Provider, error: 0 - The pipe has been ended.)
     
     
     
    微软官方有个hotfix针对这种情况,可惜下载执行提示已不适应当前系统了;
    该job使用dblink的内存与sql server设置的最大内存以及是否锁定内存页都没有直接关系,我们甚至把max memroy 值从28G调小至10G,该现象还是会出现,观察该job执行的时候,其中有一段脚本reads超15万,查询有类似的案例如下:
    后续跟该项目负责人沟通之后,他们评估认为直接停用该job,该问题是没有再重现了。但我认为如他们要实现需求还可以尝试:
    • 替换现有的ODBC的驱动;
    • 在job中代码做些优化;
    • 或换其他模式如SSIS等;

    结论

    从这个案例来看,直接使用DBlink做数据对接,且每次数据量超5万以上不是一个首选方案;

    参考

     
  • 相关阅读:
    20200226 Java IO流——廖雪峰
    20200225 Java 多线程(2)-廖雪峰
    20200225 Java 多线程(1)-廖雪峰
    20200224 尚硅谷ElasticSearch【归档】
    20200224 一 概述
    20200222 尚硅谷Dubbo【归档】
    20200222 四、dubbo原理
    Improved robustness of reinforcement learning policies upon conversion to spiking neuronal network platforms applied to Atari Breakout game
    Reinforcement learning in populations of spiking neurons
    Solving the Distal Reward Problem through Linkage of STDP and Dopamine Signaling
  • 原文地址:https://www.cnblogs.com/jil-wen/p/12486339.html
Copyright © 2011-2022 走看看