zoukankan      html  css  js  c++  java
  • 用windbg分析一个dead lock的问题

    难得Winform项目中碰到dead lock,记录一下。

    QA报告说,有时候晚上跑完自动化脚本,第二天早上来发现系统hang在屏保界面没反应,从日志看也没有报错。这种属于很少才会发生,也不知道怎么重现,但是很严重的bug,于是抓个dump来研究一下。

    # Windbg加载dump文件后的一些文件信息

    Microsoft (R) Windows Debugger Version 6.12.0002.633 AMD64

    Copyright (c) Microsoft Corporation. All rights reserved.

    Loading Dump File [F:DeadLockHang.DMP]

    User Mini Dump File with Full Memory: Only application data is available

    Symbol search path is: *** Invalid ***

    ****************************************************************************

    * Symbol loading may be unreliable without a symbol search path. *

    * Use .symfix to have the debugger choose a symbol path. *

    * After setting your symbol path, use .reload to refresh symbol locations. *

    ****************************************************************************

    Executable search path is:

    Windows 7 Version 7601 (Service Pack 1) MP (4 procs) Free x64

    Product: WinNt, suite: EmbeddedNT SingleUserTS

    Machine Name:

    Debug session time: Thu Feb 27 02:47:42.000 2014 (UTC + 8:00)

    System Uptime: 4 days 16:41:57.807

    Process Uptime: 4 days 16:41:28.000

    ................................................................

    Loading unloaded module list

    ................................................................

    *** ERROR: Symbol file could not be found. Defaulted to export symbols for ntdll.dll -

    ntdll!NtWaitForMultipleObjects+0xa:

    00000000`7706186a c3 ret

    # 设置symbol path

    0:000> .sympath srv*c:MyServerSymbols*http://msdl.microsoft.com/download/symbols

    Symbol search path is: srv*c:MyServerSymbols*http://msdl.microsoft.com/download/symbols

    Expanded Symbol search path is: srv*c:myserversymbols*http://msdl.microsoft.com/download/symbols

    #手动load SOS.dll,因为我们的调试机和测试机上的SOS不是同一个版本,所以不用.loadby sos clr

    0:000> .load F:SOS.dll1008SOS.dll

    #看看线程都在干啥

    0:000> ~* e!clrstack

    OS Thread Id: 0xfe8 (0)

    Child SP IP Call Site

    00000000001d5f80 000000007706186a [GCFrame: 00000000001d5f80]

    00000000001d6128 000000007706186a [GCFrame: 00000000001d6128]

    00000000001d60d8 000000007706186a [HelperMethodFrame_1OBJ: 00000000001d60d8] System.Threading.Monitor.Enter(System.Object)

    00000000001d6220 000007ff02b1c734 TheSystem.CCDialog.OnClosed(System.EventArgs)

    00000000001d62c0 000007fef18f4695 System.Windows.Forms.Form.CheckCloseDialog(Boolean)*** WARNING: Unable to verify checksum for System.Windows.Forms.ni.dll

    00000000001d6330 000007fef1e64073 System.Windows.Forms.Application+ThreadContext.System.Windows.Forms.UnsafeNativeMethods.IMsoComponent.FContinueMessageLoop(Int32, Int32, MSG[])

    00000000001d6470 000007fef12aadbe System.Windows.Forms.Application+ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr, Int32, Int32)

    00000000001d66b0 000007fef12aa813 System.Windows.Forms.Application+ThreadContext.RunMessageLoopInner(Int32, System.Windows.Forms.ApplicationContext)

    00000000001d6810 000007fef12aa1a1 System.Windows.Forms.Application+ThreadContext.RunMessageLoop(Int32, System.Windows.Forms.ApplicationContext)

    00000000001d68a0 000007fef18f8a52 System.Windows.Forms.Form.ShowDialog(System.Windows.Forms.IWin32Window)

    00000000001d6c10 000007ff02c844f6 TheSystem.CCMessageBox.ShowInternal(System.Windows.Forms.Form, MessageBoxIcon, System.String, System.String, System.Windows.Forms.MessageBoxButtons, Boolean, System.Collections.Generic.Dictionary`2<System.String,System.String>, System.Windows.Forms.Control, System.Drawing.ContentAlignment)

    00000000001d70e0 000007ff02c862cd TheSystem.CCMessageBox.Show(MessageBoxIcon, System.String, System.String, System.Windows.Forms.MessageBoxButtons)

    ......<cut to save space>......

    OS Thread Id: 0x2180 (132)

    Child SP IP Call Site

    0000000043cad940 000000007706186a [GCFrame: 0000000043cad940]

    0000000043cadb98 000000007706186a [GCFrame: 0000000043cadb98]

    0000000043cadb48 000000007706186a [HelperMethodFrame: 0000000043cadb48] System.Threading.Monitor.Enter(System.Object)

    0000000043cadca0 000007ff02ca8a18 TheSystem.CCDialog.CloseModalDialogs()

    0000000043cadd30 000007ff02ca8937 TheSystem.CCUserActivityManager.autoLogoutTimer_Elapsed(System.Object, System.Timers.ElapsedEventArgs)

    0000000043cadda0 000007fef5f01b51 System.Timers.Timer.MyTimerCallback(System.Object)

    0000000043cade20 000007fef6ba2f78 System.Threading.ExecutionContext.runTryCode(System.Object)

    0000000043cae548 000007fef7ba44c4 [HelperMethodFrame_PROTECTOBJ: 0000000043cae548] System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode, CleanupCode, System.Object)

    0000000043cae670 000007fef6b91661 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)

    0000000043cae6d0 000007fef6c0ced7 System.Threading._TimerCallback.PerformTimerCallback(System.Object)

    0000000043cae948 000007fef7ba44c4 [GCFrame: 0000000043cae948]

    0000000043caeb20 000007fef7ba44c4 [DebuggerU2MCatchHandlerFrame: 0000000043caeb20]

    ......<cut to save space>......

    OS Thread Id: 0x363c (1076)

    Child SP IP Call Site

    00000000d190e7a0 000000007706186a [GCFrame: 00000000d190e7a0]

    00000000d190e9f8 000000007706186a [GCFrame: 00000000d190e9f8]

    00000000d190e9a8 000000007706186a [HelperMethodFrame: 00000000d190e9a8] System.Threading.Monitor.Enter(System.Object)

    00000000d190eb00 000007ff02ca8a18 TheSystem.CCDialog.CloseModalDialogs()

    00000000d190eb90 000007ff02ca8937 TheSystem.CCUserActivityManager.autoLogoutTimer_Elapsed(System.Object, System.Timers.ElapsedEventArgs)

    00000000d190ec00 000007fef5f01b51 System.Timers.Timer.MyTimerCallback(System.Object)

    00000000d190ec80 000007fef6ba2f78 System.Threading.ExecutionContext.runTryCode(System.Object)

    00000000d190f3a8 000007fef7ba44c4 [HelperMethodFrame_PROTECTOBJ: 00000000d190f3a8] System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode, CleanupCode, System.Object)

    00000000d190f4d0 000007fef6b91661 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)

    00000000d190f530 000007fef6c0ced7 System.Threading._TimerCallback.PerformTimerCallback(System.Object)

    00000000d190f7a8 000007fef7ba44c4 [GCFrame: 00000000d190f7a8]

    00000000d190f980 000007fef7ba44c4 [DebuggerU2MCatchHandlerFrame: 00000000d190f980]

    ......<cut to save space>......

    OS Thread Id: 0x11a0 (1120)

    Failed to start stack walk: 80004005

    结果发现Windbg异常繁忙,打印了很久才停下来,1120个线程…正常情况下应该是百来个线程。而且从堆栈上来看中间绝大多数(好几百个)线程的call stack都是和132号线程是一样的,看上去都在等一个什么锁。

      #等啥呢?

    0:000>!syncblk

    Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner

    1453 0000000039f170a8 21 1 000000002e74f8e0 1a3c 82 00000000143cccd0 System.Object

    3711 000000002dba2568 1823 1 0000000067884fe0 1920 151 0000000011271e58 System.Collections.Hashtable

    -----------------------------

    Total 4562

    CCW 551

    RCW 21

    ComClassFactory 14

    Free 1196

    也就是说151号线程拥有0000000011271e58这个hashtable对象的锁,然后还有 823个线程在等这个锁!!!

    (可以参考Tess对!syncblk的解释http://blogs.msdn.com/b/tess/archive/2006/01/09/a-hang-scenario-locks-and-critical-sections.aspx)

    #先看看151号线程

    0:000> ~151 e!clrstack

    OS Thread Id: 0x1920 (151)

    Child SP IP Call Site

    000000005ec9dbb8 000000007706186a [HelperMethodFrame_1OBJ: 000000005ec9dbb8] System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean)

    000000005ec9dcf0 000007fef6c1c094 System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean)*** WARNING: Unable to verify checksum for mscorlib.ni.dll

    000000005ec9dd30 000007fef18b01ad System.Windows.Forms.Control.WaitForWaitHandle(System.Threading.WaitHandle)*** WARNING: Unable to verify checksum for System.Windows.Forms.ni.dll

    000000005ec9de20 000007fef12a41e2 System.Windows.Forms.Control.MarshaledInvoke(System.Windows.Forms.Control, System.Delegate, System.Object[], Boolean)

    000000005ec9dfa0 000007fef18b2d99 System.Windows.Forms.Control.Invoke(System.Delegate, System.Object[])

    000000005ec9e020 000007ff02ca8b34 TheSystem.CCDialog.CloseModalDialogs()***

    000000005ec9e0b0 000007ff02ca8937 TheSystem.CCUserActivityManager.autoLogoutTimer_Elapsed(System.Object, System.Timers.ElapsedEventArgs)*** ERROR: Module load completed but symbols could not be loaded for TheSystem.dll

    000000005ec9e120 000007fef5f01b51 System.Timers.Timer.MyTimerCallback(System.Object)*** WARNING: Unable to verify checksum for System.ni.dll

    ......<cut to save space>......

    #等UI线程呢,看看UI线程在干啥。

    0:000> !clrstack

    OS Thread Id: 0xfe8 (0)

    Child SP IP Call Site

    00000000001d5f80 000000007706186a [GCFrame: 00000000001d5f80]

    00000000001d6128 000000007706186a [GCFrame: 00000000001d6128]

    00000000001d60d8 000000007706186a [HelperMethodFrame_1OBJ: 00000000001d60d8] System.Threading.Monitor.Enter(System.Object)

    00000000001d6220 000007ff02b1c734 TheSystem.CCDialog.OnClosed(System.EventArgs)

    00000000001d62c0 000007fef18f4695 System.Windows.Forms.Form.CheckCloseDialog(Boolean)*** WARNING: Unable to verify checksum for System.Windows.Forms.ni.dll

    00000000001d6330 000007fef1e64073 System.Windows.Forms.Application+ThreadContext.System.Windows.Forms.UnsafeNativeMethods.IMsoComponent.FContinueMessageLoop(Int32, Int32, MSG[])

    00000000001d6470 000007fef12aadbe System.Windows.Forms.Application+ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr, Int32, Int32)

    00000000001d66b0 000007fef12aa813 System.Windows.Forms.Application+ThreadContext.RunMessageLoopInner(Int32, System.Windows.Forms.ApplicationContext)

    00000000001d6810 000007fef12aa1a1 System.Windows.Forms.Application+ThreadContext.RunMessageLoop(Int32, System.Windows.Forms.ApplicationContext)

    00000000001d68a0 000007fef18f8a52 System.Windows.Forms.Form.ShowDialog(System.Windows.Forms.IWin32Window)

    00000000001d6c10 000007ff02c844f6 TheSystem.CCMessageBox.ShowInternal(System.Windows.Forms.Form, MessageBoxIcon, System.String, System.String, System.Windows.Forms.MessageBoxButtons, Boolean, System.Collections.Generic.Dictionary`2<System.String,System.String>, System.Windows.Forms.Control, System.Drawing.ContentAlignment)

    00000000001d70e0 000007ff02c862cd TheSystem.CCMessageBox.Show(MessageBoxIcon, System.String, System.String, System.Windows.Forms.MessageBoxButtons)

    ......<cut to save space>......

    同时!do了一下hashtable对象0000000011271e58,发现里面有三个dialog,和log里面的一致。接下来看代码:

    拥有锁的151号线程 - TheSystem.CCDialog.CloseModalDialogs()

    source1

    UI线程  TheSystem.CCDialog.OnClosed(System.EventArgs)

    2

    按照系统的逻辑,一段时间没有用户操作后会自动logout,如果自动logout前存在modal dialog,那么先自动close modal dialog,然后再logout。正常情况下logout前最多只有1个modal dialog,问题就出在出问题时logout前有3个model dialog,

    那么Hang的过程是怎么产生的呢? 151号线程获取了锁,然后Invoke UI线程去Close了第一个Dialog,注意Invoke完了之后还是继续回到本线程执行,同时这个时候UI线程去响应WM_CLOSE消息然后调用OnClosed () 方法等锁, 占着锁的 151号线程接着去Invoke UI线程Close下一个dialog.

    也就是说151号线程占有了锁,然后调Invoke—同步调用,去调用UI线程。但是UI线程呢,在等151号线程占着的锁然后才能往下执行,但是151号线程一直占着锁不释放,所以UI线程也一直在那里等着 – 死锁, dead lock。 正常情况下,只有一个dialog的时候,Invoke()完了就直接退出了,不用等下次Invoke(),所以不会有死锁。

    那么1120个线程又是怎么回事呢?因为TheSystem.CCDialog.CloseModalDialogs()是被timer触发的,所以每过一段时间就有一个新的线程起来挂在那里等那个锁,所以我们才会看到有那么多数量的线程。

    解决方案有多种,最简单的就是把Invoke改成BeginInvoke,这样非UI线程可以很快退出,把锁让给UI线程。当然也可以把CloseModalDialogs()放在UI线程里面调用。

  • 相关阅读:
    视频:JDBCRDD源码及自定义JDBCRDD的分区策略
    聊聊spark-submit的几个有用选项
    线性求第k大
    汇编基础 第一章_总线
    长度不超过n的连续最大和___优先队列
    ISAP
    次小生成树
    k短路
    求出欧拉回路,欧拉路径
    拓扑排序bfs_dfs
  • 原文地址:https://www.cnblogs.com/fbird/p/3582565.html
Copyright © 2011-2022 走看看