现场报错如下:
Process 0:0:0 (0x1ffc) Worker 0x00000001E580A1A0 appears to be non-yielding on Scheduler 3. Thread creation time: 13153975602106.
Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 21%. System Idle 76%. Interval: 73616 ms.
从报错来看是某个worker占用scheduler不释放导致,服务器具体的scheduler个数可以从sys.dm_os_sys_info.scheduler_count列中看到,一般来说服务器也就那么些核数,其中如果被non-yeilding的worker长久占用一些,将可能导致非常严重的性能问题。引发此问题的原因通常是某些查询并行度很高的SQL,也与一个BUG有关,目前从一些文档来看,此BUG并未被解决,详见:https://connect.microsoft.com/SQLServer/feedback/details/521928/worker-w-appears-to-be-non-yielding-on-scheduler-s
从2009年SQL Server2005开始,到2017年9月SQL Server 2016版本这个问题依然被报告,微软并未给出明确的解决办法。
因此目前只能在出现问题后及时找出相关的SQL优化,并且通过修改查询调控器或者并行查询阈值来避免问题出现。
从官网的解释来看:这是一个微软winsock API自己的BUG。
I am with the SQL Protocols Dev team. I've looked at the issue and most likely it is related to the creation of child processes from within SQL Server - e.g.
xp_cmdshell, 3rd party backup or compression software, etc. The problem occurs due to a limitation of the Winsock API which does not offer creation of
non-inheritable socket handles. If a child process of SQL Server is created after a connected socket is created and before the inheritance flag of the
socket is cleared, the child process inherits the socket handle. When closing the corresponding client connection, a proper Tcp cleanup cannot be performed
before the child process of SQL Server is closed.
To implement a temporary workaround, add the TcpAbortiveClose SQL parameter to registry as described in KB article http://support.microsoft.com/kb/307197.
Please be careful if you are doing this on a cluster to avoid checkpoint issue and registry misconfiguration which would prevent SQL from coming online.
SQL Server needs to be restarted in order the changes to take effect. No significant perf impact is expected from the change, no functional impact should be either.