http://blog.csdn.net/guoqifa29/article/details/42400943
1、addMonitor()
WindowManagerService、ActivityManagerService、PowerManagerService、NetworkManagementService、MountService、InputManagerService等service通过Watchdog.getInstance().addMonitor(this)将自己(实现了Watchdog.Monitor)添加到Watchdog.mMonitorChecker.mMonitors列表中;该列表会不断被调用Monitor.Monitor()函数,这个函数很简单,就是去获取对应锁,如果线程死锁或其他原因阻塞,那么必然无法获取锁,Monitor()函数执行必然会阻塞。Watchdog就是利用这个原理来判断System_server是否死锁。
public void monitor() {
synchronized (mLock) { }
nativeMonitor(mPtr);
}
2、addThread()
将WindowManagerService、PowerManagerService、PackageManagerService、ActivityManagerService四个主线程Handler保存到Watchdog.mHandlerCheckers列表中;同时还会把第1点中的mMonitorChecker也保存到Watchdog.mHandlerCheckers中;同时还会将UiThread、IOThread、MainThread的Handler保存到Watchdog.mHandlerCheckers中来,当然这四个线程处在System_Server进程中,总共8个线程的Handler;Watchdog会不断判断这些线程的Looper是否空闲,如果一直非空闲,那么必然被blocked住了。
UiThread---主要用于叠加视图(OverlayDisplay)、显示触摸指针(PointerEventDispatcher)、PhoneWindowManager?;
IOThread---主要用于(BluetoothManagerService、JobStore、MountService中的OBB操作、PackageInstallerService中writeSessionsAsync()、Tethering、PacManager中ACTION_PAC_REFRESH广播处理、TvInputManagerService中mWatchLogHandler);
MainThread---是SystemServer主线程吧;
3、run()
- public void run() {
- boolean waitedHalf = false;
- while (true) {
- final ArrayList<HandlerChecker> blockedCheckers;
- final String subject;
- final boolean allowRestart;
- synchronized (this) {
- long timeout = CHECK_INTERVAL;
- // Make sure we (re)spin the checkers that have become idle within
- // this wait-and-check interval
- for (int i=0; i<mHandlerCheckers.size(); i++) { //①mHandlerCheckers中包含一个mMonitorChecker和7个重要线程的HandlerChecker(当然是System_Server进程中最重要的线程啦);该for循环主要检测mMonitorChecker中重要的几把锁(检测死锁)、7个重要线程的消息队列是否空闲(检测是否blocked),当然判断是否死锁和被blocked的依据便是是否超时啦;
- HandlerChecker hc = mHandlerCheckers.get(i);
- hc.scheduleCheckLocked(); //②如果一个线程死锁或被blocked,那么该HandlerChecker的mCompleted = false、mStartTime=阻塞的初始时间点;这两个变量是判断的基础;
- }
- // NOTE: We use uptimeMillis() here because we do not want to increment the time we
- // wait while asleep. If the device is asleep then the thing that we are waiting
- // to timeout on is asleep as well and won't have a chance to run, causing a false
- // positive on when to kill things.
- long start = SystemClock.uptimeMillis();
- while (timeout > 0) { //③等待3秒;
- try {
- wait(timeout);
- } catch (InterruptedException e) {
- Log.wtf(TAG, e);
- }
- timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
- }
- final int waitState = evaluateCheckerCompletionLocked(); //④根据mCompleted 、mStartTime值评估等待状态;
- if (waitState == COMPLETED) { //⑤如果运行顺畅,那么在此处便return,也就是每隔3秒检测一次;
- // The monitors have returned; reset
- waitedHalf = false;
- continue;
- } else if (waitState == WAITING) { //⑤30秒之内,继续检查;
- // still waiting but within their configured intervals; back off and recheck
- continue;
- } else if (waitState == WAITED_HALF) { //⑤30~60秒之内,继续检查;
- if (!waitedHalf) {
- // We've waited half the deadlock-detection interval. Pull a stack
- // trace and wait another half.
- ArrayList<Integer> pids = new ArrayList<Integer>();
- pids.add(Process.myPid());
- ActivityManagerService.dumpStackTraces(true, pids, null, null,
- NATIVE_STACKS_OF_INTEREST);
- waitedHalf = true;
- }
- continue;
- }
- // something is overdue!
- blockedCheckers = getBlockedCheckersLocked(); //超了60秒,此时便出问题了;收集阻塞的线程;
- subject = describeCheckersLocked(blockedCheckers); //将阻塞线程写到一个字符串中方便下面打印到Event日志中;
- allowRestart = mAllowRestart;
- }
- // If we got here, that means that the system is most likely hung.
- // First collect stack traces from all threads of the system process.
- // Then kill this process so that the system will restart.
- EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
- ArrayList<Integer> pids = new ArrayList<Integer>();
- pids.add(Process.myPid());
- if (mPhonePid > 0) pids.add(mPhonePid);
- // Pass !waitedHalf so that just in case we somehow wind up here without having
- // dumped the halfway stacks, we properly re-initialize the trace file.
- final File stack = ActivityManagerService.dumpStackTraces(
- !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST); //⑥调用ActivityManagerService.dumpStackTraces(),该函数会调用Process.sendSignal(firstPids.get(i), Process.SIGNAL_QUIT)给System_server进程发送-3信号,这样虚拟机便可打印出trace信息,同时该函数还会收集/system/bin/mediaserver、/system/bin/sdcard、/system/bin/surfaceflinger三个native进程的trace信息;同时还会将CPU的usage打印出来;
- // Give some extra time to make sure the stack traces get written.
- // The system's been hanging for a minute, another second or two won't hurt much.
- SystemClock.sleep(2000); //让系统睡眠2秒;注释说既然system已经被挂起了60秒,那么再sleep 2秒也won't hurt much;
- // Pull our own kernel thread stacks as well if we're configured for that
- if (RECORD_KERNEL_THREADS) {
- dumpKernelStackTraces(); //⑥将kernel的trace也打印出来;kernel的trace主要打印/proc/%pid/task,/proc/%tid/stack中的信息;
- }
- // Trigger the kernel to dump all blocked threads to the kernel log
- try {
- FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger"); //⑥向/proc/sysrq-trigger中写一个字符“w”,会触发kernel将blocked线程写到kernel日志中;
- sysrq_trigger.write("w");
- sysrq_trigger.close();
- } catch (IOException e) {
- Slog.e(TAG, "Failed to write to /proc/sysrq-trigger");
- Slog.e(TAG, e.getMessage());
- }
- // Try to add the error to the dropbox, but assuming that the ActivityManager
- // itself may be deadlocked. (which has happened, causing this statement to
- // deadlock and the watchdog as a whole to be ineffective)
- Thread dropboxThread = new Thread("watchdogWriteToDropbox") { //⑦创建一个线程用于将error add到DropBox中;
- public void run() {
- mActivity.addErrorToDropBox(
- "watchdog", null, "system_server", null, null,
- subject, null, stack, null);
- }
- };
- dropboxThread.start();
- try {
- dropboxThread.join(2000); // wait up to 2 seconds for it to return.
- } catch (InterruptedException ignored) {}
- IActivityController controller;
- synchronized (this) {
- controller = mController;
- }
- if (controller != null) {
- Slog.i(TAG, "Reporting stuck state to activity controller");
- try {
- Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
- // 1 = keep waiting, -1 = kill system
- int res = controller.systemNotResponding(subject);
- if (res >= 0) {
- Slog.i(TAG, "Activity controller requested to coninue to wait");
- waitedHalf = false;
- continue;
- }
- } catch (RemoteException e) {
- }
- }
- // Only kill the process if the debugger is not attached.
- if (Debug.isDebuggerConnected()) {
- Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
- } else if (!allowRestart) {
- Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
- } else {
- Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
- for (int i=0; i<blockedCheckers.size(); i++) {
- Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
- StackTraceElement[] stackTrace
- = blockedCheckers.get(i).getThread().getStackTrace();
- for (StackTraceElement element: stackTrace) {
- Slog.w(TAG, " at " + element);
- }
- }
- Slog.w(TAG, "*** GOODBYE!");
- Process.killProcess(Process.myPid()); //⑧将System_server干掉,系统重启;
- System.exit(10);
- }
- waitedHalf = false;
- }
- }
总结:Watchdog的机制就是在一个独立线程中,每隔3秒?会检查System_Server中重要的几把锁(包括WindowManagerService、ActivityManagerService、PowerManagerService、NetworkManagementService、MountService、InputManagerService等)、同时还会检测最重要的7个线程消息队列是否空闲(WindowManagerService、PowerManagerService、PackageManagerService、ActivityManagerService、UiThread、IOThread、MainThread),最终根据mCompleted 、mStartTime值来判断是否阻塞超时60S,如果发生超时,那么将打印trace日志和kernel trace日志,最后将SystemServer干掉重启。