zoukankan      html  css  js  c++  java
  • Android ANR分析(2)

    转自:http://blog.csdn.net/ruingman/article/details/53118202

    定义

    主线程在特定的时间内没有做完特定的事情

    常见的场景

       A.input事件超过5S没有处理完成
       B.service executing 超时(bind,create,start,unbind等等),前台20s,后台200s
       C.广播处理超时,前台10S,后台60s
       D.ContentProvider执行超时,20s
    

    常见的原因

            A.耗时操作,如复杂的layout,庞大的for循环,IO等。
    	B.被Binder 对端block
    	C.被子线程同步锁block
            D.Binder被占满导致主线程无法和SystemServer通信
    	E.得不到系统资源(CPU/RAM/IO)
    	其中ABCD比较好分析,而E比较困难。
    

    如何分析?

    指导思想:通过各种线索,寻求主线程在过去一段时间内发生block的可能原因。

    线索包括:
       A.traces.txt/dropbox
         AMS在ANR发生的时候,dump相关进程(ANR的进程、systemserver、mediaserver,surfaceFinger等)的当前栈到traces.txt。
         traces.txt存在的几种形式:
         1. adb pull /data/anr/
         2. slog/2012xxxxx/misc/anr/snapshot_xxxx.log
         3. slog/dropbox/data_app_anr_xxxxx.tgz
         4. slog/dropbox/system_app_anr_xxxx.tgz
         需要注意的是,traces.txt是抓取的是超时后(如input超时就是5s后)的snapshot,并不一定能够真实的反应出block的点。
    	   也存在抓到主线程没有block,在idle的情况。
       B.Eventlog中的dvm_lock_sample.
         在同步锁发生content的时候,虚拟机会将这个竞争记录在eventlog中:
    	 dvm_lock_sample: [system_server,1,Binder_7,    22,           ActivityManagerService.java,15921,-,9628,4]
    	                   进程名           被block线程  block时间(ms)  被block的行号                       持有者行号
    	实现可以参考art/runtime/monitor_android.cc LogContentionEvent函数
            如果主线程是被binder对端、被同步锁block,那么eventlog中就很有可能会有dvm_lock_sample的打印。
         华为项目上实现了更加强大的功能BlockMonitor,在主线程有耗时操作(如handlemsg、Binder调用耗时)的时候会打印出栈。
         
       C.搜索主线程在发生ANR前后的main,systemlog,结合代码查看可能block在哪里。
         现在发生ANR的时候,sprdruntimeinfo在mainlog中会Dump出比较多的信息,其中比较重要的是binder、cpu。
       D.ANR发生的时候,打印出的CPU的占用。
         注意,并不能因为有进程cpu占用高就果断的去怀疑他。
       因为最主要的线索traces.txt的有效性并不是非常高,所以ANR问题分析是存在一定的局限性的。 
    

    案例分析

    1. Settings 被对端block

     首先需要去看的就是traces.txt 中ANR进程的主线程的栈。
       以settting的这次ANR为例:
       
       "main" prio=5 tid=1 Native
      | group="main" sCount=1 dsCount=0 obj=0x73ee6470 self=0xb4d76500
      | sysTid=22831 nice=0 cgrp=default sched=0/0 handle=0xb6f4bc00
      | state=S schedstat=( 0 0 0 ) utm=22 stm=22 core=0 HZ=100
      | stack=0xbe283000-0xbe285000 stackSize=8MB
      | held mutexes=
      native: #00 pc 000410ac  /system/lib/libc.so (__ioctl+8)
      native: #01 pc 000477e5  /system/lib/libc.so (ioctl+14)
      native: #02 pc 0001e7c5  /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+132)
      native: #03 pc 0001ee17  /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+38)
      native: #04 pc 0001efcd  /system/lib/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+124)
      native: #05 pc 00019fb7  /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+30)
      native: #06 pc 00086de9  /system/lib/libandroid_runtime.so (???)
      native: #07 pc 00d94629  /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140)
      at android.os.BinderProxy.transactNative(Native method)
      at android.os.BinderProxy.transact(Binder.java:503)
      at android.net.INetworkPolicyManager$Stub$Proxy.getNetworkPolicies(INetworkPolicyManager.java:410)
      at android.net.NetworkPolicyManager.getNetworkPolicies(NetworkPolicyManager.java:174)
      at com.android.settings.net.NetworkPolicyEditor.read(NetworkPolicyEditor.java:57)
      at com.android.settings.DataUsageSummary.onCreate(DataUsageSummary.java:361)
      at android.app.Fragment.performCreate(Fragment.java:2202)
      at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:942)
      at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1148)
      at android.app.BackStackRecord.run(BackStackRecord.java:793)
      at android.app.FragmentManagerImpl.execPendingActions(FragmentManager.java:1535)
      at android.app.FragmentManagerImpl.executePendingTransactions(FragmentManager.java:562)
      at com.android.settings.SettingsActivity.switchToFragment(SettingsActivity.java:1084)
      at com.android.settings.SettingsActivity.onCreate(SettingsActivity.java:657)
      at android.app.Activity.performCreate(Activity.java:6251)
      at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1110)
      at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2370)
      at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2477)
      at android.app.ActivityThread.-wrap11(ActivityThread.java:-1)
      at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1345)
      at android.os.Handler.dispatchMessage(Handler.java:102)
      at android.os.Looper.loop(Looper.java:148)
      at android.app.ActivityThread.main(ActivityThread.java:5438)
      at java.lang.reflect.Method.invoke!(Native method)
      at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:762)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:652)
      主线程被对端block,而对端是在systemserver中的NetworkPolicyManager。
      接下来继续去查NetworkPolicyManager为何会被block。
      通常情况下,可以搜索getNetworkPolicies,一般的对端的函数并不会修改函数名:
      看到我们的对端:
      "Binder_4" prio=5 tid=56 Blocked
      | group="main" sCount=1 dsCount=0 obj=0x1321a0a0 self=0xad31e200
      | sysTid=2491 nice=0 cgrp=default sched=0/0 handle=0x9fd00930
      | state=S schedstat=( 0 0 0 ) utm=46441 stm=46582 core=1 HZ=100
      | stack=0x9fc04000-0x9fc06000 stackSize=1014KB
      | held mutexes=
      at com.android.server.net.NetworkPolicyManagerService.getNetworkPolicies(NetworkPolicyManagerService.java:1696)
      - waiting to lock <0x07439315> (a java.lang.Object) held by thread 35
      at android.net.INetworkPolicyManager$Stub.onTransact(INetworkPolicyManager.java:145)
      at android.os.Binder.execTransact(Binder.java:453
      被Tid=35的人拿住了一把锁(0x07439315),那么继续看tid=35是谁,有两种方法:
      1. 搜索tid=35
      2. 搜索0x07439315,找到  - locked <0x07439315> (a java.lang.Object)
      
      "NetworkPolicy" prio=5 tid=35 TimedWaiting
      | group="main" sCount=1 dsCount=0 obj=0x12d98940 self=0x9f91f700
      | sysTid=2415 nice=0 cgrp=default sched=0/0 handle=0xa0f33930
      | state=S schedstat=( 0 0 0 ) utm=7681 stm=7783 core=0 HZ=100
      | stack=0xa0e31000-0xa0e33000 stackSize=1038KB
      | held mutexes=
      at java.lang.Object.wait!(Native method)
      - waiting on <0x02580c1b> (a java.lang.Object)
      at java.lang.Thread.parkFor$(Thread.java:1220)
      - locked <0x02580c1b> (a java.lang.Object)
      at sun.misc.Unsafe.park(Unsafe.java:299)
      at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2053)
      at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:372)
      at com.android.server.NativeDaemonConnector$ResponseQueue.remove(NativeDaemonConnector.java:634)
      at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:426)
      at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:345)
      at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:340)
      at com.android.server.NetworkManagementService.setInterfaceQuota(NetworkManagementService.java:1712)
      - locked <0x0b0f91b8> (a java.lang.Object)
      at com.android.server.net.NetworkPolicyManagerService.setInterfaceQuota(NetworkPolicyManagerService.java:2421)
      at com.android.server.net.NetworkPolicyManagerService.updateNetworkRulesLocked(NetworkPolicyManagerService.java:1232)
      at com.android.server.net.NetworkPolicyManagerService$14.onReceive(NetworkPolicyManagerService.java:1060)
      - locked <0x07439315> (a java.lang.Object)
      at android.app.LoadedApk$ReceiverDispatcher$Args.run(LoadedApk.java:881)
      at android.os.Handler.handleCallback(Handler.java:739)
      at android.os.Handler.dispatchMessage(Handler.java:95)
      at android.os.Looper.loop(Looper.java:148)
      at android.os.HandlerThread.run(HandlerThread.java:61)
      
      可以看到,NetworkPolicy在通过NativeDaemonConnector和netd通信(setInterfaceQuota)
      我们结合log来看下是否有有用信息,按照之前的经验,netd在执行完cmd的时候,会打印出slow operation。
      在发生ANR的前后,查找netd相关的打印:
      06-19 15:29:00.997  1235  1270 I am_anr  : [0,22831,com.android.settings,818429509,Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.)]
      06-19 15:29:05.683  1235  2415 E NetdConnector: NDC Command {55445 bandwidth setiquota seth_w0 9223372036854775807} took too long (4755ms
      06-19 15:29:05.723  1235  2491 I dvm_lock_sample: [system_server,1,Binder_4,4919,NetworkPolicyManagerService.java,1696,-,1056,100]
      从eventslog中还可以发现,setting出现了两次连续的ANR,而上下文中都有类似上面的dvm_lock_sample NetworkPolicyManagerService相关的告警。  
      因此虽然15:28的这次ANR并没有打出有用的栈,但是我们还是可以猜测出这两次ANR的原因都是netd的cmd耗时太久导致的(在主线程的调用路径上存在必然的block)
      那么,在netd可能无法修改的情况下,我们应该如何去resolve这个问题呢。
      将可能存在block的操作放到非UI线程中去做。
      
    

    2.com.huawei.hwvplayer.youku对端block:

      打开dropbox中对应的system_app_anrxxxx:
         查看主线程的栈:
      "main" prio=5 tid=1 Native
      | group="main" sCount=1 dsCount=0 obj=0x752b0000 self=0xb4276500
      | sysTid=25390 nice=-1 cgrp=default sched=3/0 handle=0xb6f18b34
      | state=S schedstat=( 0 0 0 ) utm=81 stm=12 core=2 HZ=100
      | stack=0xbe78b000-0xbe78d000 stackSize=8MB
      | held mutexes=
      kernel: (couldn't read /proc/self/task/25390/stack)
      native: #00 pc 000422d0  /system/lib/libc.so (__ioctl+8)
      native: #01 pc 00047825  /system/lib/libc.so (ioctl+14)
      native: #02 pc 0001e835  /system/lib/libbinder.so (_ZN7android14IPCThreadState14talkWithDriverEb+132)
      native: #03 pc 0001ee93  /system/lib/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+38)
      native: #04 pc 0001f049  /system/lib/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+124)
      native: #05 pc 00019fe3  /system/lib/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+30)
      native: #06 pc 0008a035  /system/lib/libandroid_runtime.so (???)
      native: #07 pc 00d78869  /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140)
      at android.os.BinderProxy.transactNative(Native method)
      at android.os.BinderProxy.transact(Binder.java:510)
      at android.os.storage.IMountService$Stub$Proxy.getVolumeList(IMountService.java:771)
      at android.os.storage.StorageManager.getVolumeList(StorageManager.java:883)
      at android.os.Environment$UserEnvironment.getExternalDirs(Environment.java:95)
      at android.os.Environment.getExternalStorageDirectory(Environment.java:354)
      at com.huawei.common.utils.PathUtils.<clinit>(PathUtils.java:51)
      at com.huawei.common.utils.PathUtils.getWorkspacePath(PathUtils.java:80)
      at com.huawei.common.components.log.Logger.<clinit>(Logger.java:37)
      at com.huawei.common.components.log.Logger.i(Logger.java:162)
      at com.huawei.hwvplayer.data.db.DbProvider.attachInfo(DbProvider.java:89)
      at android.app.ActivityThread.installProvider(ActivityThread.java:5279)
      at android.app.ActivityThread.installContentProviders(ActivityThread.java:4868)
      at android.app.ActivityThread.handleBindApplication(ActivityThread.java:4799)
      at android.app.ActivityThread.access$1600(ActivityThread.java:165)
      at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1436)
      at android.os.Handler.dispatchMessage(Handler.java:102)
      at android.os.Looper.loop(Looper.java:188)
      at android.app.ActivityThread.main(ActivityThread.java:5578)
      at java.lang.reflect.Method.invoke!(Native method)
      at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684)
      
      主线程尝试去调用MountService的 getVolumeList接口,可能没有返回。
      查看system_server中相关的,搜索getVolumeList
      发现systemserver中有三个Binder线程和主线程被block,我们的对端是哪个Binder线程暂时无法确认,但这并不影响我们继续分析,因为他们被blockd的路径是一致的:
      
      
    "main" prio=5 tid=1 Blocked
      | group="main" sCount=1 dsCount=0 obj=0x752b0000 self=0xb4276500
      | sysTid=22735 nice=-2 cgrp=default sched=0/0 handle=0xb6f18b34
      | state=S schedstat=( 0 0 0 ) utm=432 stm=85 core=1 HZ=100
      | stack=0xbe78b000-0xbe78d000 stackSize=8MB
      | held mutexes=
      at com.android.server.MountService.getVolumeList(MountService.java:2759)
      - waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40
      at android.os.storage.StorageManager.getVolumeList(StorageManager.java:883)
      at android.os.storage.StorageManager.getVolumeList(StorageManager.java:858)
      at android.os.storage.StorageManager.getPrimaryVolume(StorageManager.java:906)
      at com.android.server.usb.UsbDeviceManager.systemReady(UsbDeviceManager.java:327)
      at com.android.server.usb.UsbService.systemReady(UsbService.java:181)
      at com.android.server.usb.UsbService$Lifecycle.onBootPhase(UsbService.java:78)
      at com.android.server.SystemServiceManager.startBootPhase(SystemServiceManager.java:135)
      at com.android.server.SystemServer$3.run(SystemServer.java:1489)
      at com.android.server.am.ActivityManagerService.systemReady(ActivityManagerService.java:12417)
      at com.android.server.am.HwActivityManagerService.systemReady(HwActivityManagerService.java:960)
      at com.android.server.SystemServer.startOtherServices(SystemServer.java:1485)
      at com.android.server.SystemServer.run(SystemServer.java:381)
      at com.android.server.SystemServer.main(SystemServer.java:272)
      at java.lang.reflect.Method.invoke!(Native method)
      at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684)
      
      "Binder_8" prio=5 tid=76 Blocked
      | group="main" sCount=1 dsCount=0 obj=0x13bd60a0 self=0x9c1abe00
      | sysTid=25191 nice=-1 cgrp=default sched=0/0 handle=0x97158930
      | state=S schedstat=( 0 0 0 ) utm=9 stm=7 core=2 HZ=100
      | stack=0x9705c000-0x9705e000 stackSize=1014KB
      | held mutexes=
      at com.android.server.MountService.getVolumeList(MountService.java:2759)
      - waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40
      at android.os.storage.IMountService$Stub.onTransact(IMountService.java:1634)
      at android.os.Binder.execTransact(Binder.java:453)
      
      "Binder_2" prio=5 tid=8 Blocked
      | group="main" sCount=1 dsCount=0 obj=0x12cac0a0 self=0xaebf0300
      | sysTid=22761 nice=-1 cgrp=default sched=0/0 handle=0xaef7d930
      | state=S schedstat=( 0 0 0 ) utm=41 stm=25 core=0 HZ=100
      | stack=0xaee81000-0xaee83000 stackSize=1014KB
      | held mutexes=
      at com.android.server.MountService.getVolumeList(MountService.java:2759)
      - waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40
      at android.os.storage.IMountService$Stub.onTransact(IMountService.java:1634)
      at android.os.Binder.execTransact(Binder.java:453)
       
      他们均是被tid=40的人block,按照上面的方法搜索tid=40或者 0x0eeb54f1得到block的人:
      "MountService" prio=5 tid=40 TimedWaiting
      | group="main" sCount=1 dsCount=0 obj=0x132c1160 self=0x9ce57400
      | sysTid=23512 nice=0 cgrp=default sched=0/0 handle=0x9a239930
      | state=S schedstat=( 0 0 0 ) utm=29 stm=2 core=2 HZ=100
      | stack=0x9a137000-0x9a139000 stackSize=1038KB
      | held mutexes=
      at java.lang.Object.wait!(Native method)
      - waiting on <0x00fea1f3> (a java.lang.Object)
      at java.lang.Thread.parkFor$(Thread.java:1235)
      - locked <0x00fea1f3> (a java.lang.Object)
      at sun.misc.Unsafe.park(Unsafe.java:299)
      at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2053)
      at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:372)
      at com.android.server.NativeDaemonConnector$ResponseQueue.remove(NativeDaemonConnector.java:777)
      at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:489)
      at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:386)
      at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:381)
      at com.android.server.MountService.resetIfReadyAndConnectedLocked(MountService.java:827)
      at com.android.server.MountService.handleSystemReady(MountService.java:776)
      - locked <0x0eeb54f1> (a java.lang.Object)
      at com.android.server.MountService.access$500(MountService.java:152)
      at com.android.server.MountService$MountServiceHandler.handleMessage(MountService.java:596)
      at android.os.Handler.dispatchMessage(Handler.java:102)
      at android.os.Looper.loop(Looper.java:150)
      at android.os.HandlerThread.run(HandlerThread.java:61)
     
      和上面netd类似的,mountservice也是通过ndc和vold通信,这里我们需要继续查看是否vold存在异常。
      之前提到过,类似这种同步锁block的,dvm_lock_sample一定会有打印,于是先去找eventslog,不过这个是华为的log,是没有eventlog的。
      而华为实现了一个blockMonitor的功能,和dvm_lock_sample类似,当某个操作特别耗时的时候,会将其打印出来:
      在ANR的附近找到如下:
     07-19 10:17:50.739 25271 25271 W BlockMonitor: The binder calling took 55209ms.
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.BlockMonitor.checkBinderTime(BlockMonitor.java:141)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.BinderProxy.transact(Binder.java:511)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.storage.IMountService$Stub$Proxy.getVolumeList(IMountService.java:771)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.storage.StorageManager.getVolumeList(StorageManager.java:883)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Environment$UserEnvironment.getExternalDirs(Environment.java:95)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Environment.getExternalStorageDirectory(Environment.java:354)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.parsePathStrategy(FileProvider.java:583)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.getPathStrategy(FileProvider.java:534)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.attachInfo(FileProvider.java:352)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.installProvider(ActivityThread.java:5279)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.installContentProviders(ActivityThread.java:4868)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.handleBindApplication(ActivityThread.java:4799)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.access$1600(ActivityThread.java:165)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread$H.handleMessage(ActivityThread.java:1436)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Handler.dispatchMessage(Handler.java:102)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Looper.loop(Looper.java:188)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.main(ActivityThread.java:5578)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: java.lang.reflect.Method.invoke(Native Method)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684)
    再加上之前的:
    07-19 10:17:50.729 22735 23512 E NativeDaemonConnector.ResponseQueue: Timeout waiting for response
    07-19 10:17:50.729 22735 23512 E VoldConnector: timed-out waiting for response to 4 volume reset
    07-19 10:17:50.729 22735 23512 E VoldConnector: timed-out waiting for response mOutputStream = android.net.LocalSocketImpl$SocketOutputStream@922644, mSocket = vold
    07-19 10:17:50.731 22735 23512 W MountService: Failed to reset vold
    07-19 10:17:50.731 22735 23512 W MountService: com.android.server.NativeDaemonTimeoutException: command '4 volume reset' failed with 'null'
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:501)
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:386)
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:381)
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.MountService.resetIfReadyAndConnectedLocked(MountService.java:827)
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.MountService.handleSystemReady(MountService.java:776)
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.MountService.access$500(MountService.java:152)
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.MountService$MountServiceHandler.handleMessage(MountService.java:596)
    07-19 10:17:50.731 22735 23512 W MountService: 	at android.os.Handler.dispatchMessage(Handler.java:102)
    07-19 10:17:50.731 22735 23512 W MountService: 	at android.os.Looper.loop(Looper.java:150)
    07-19 10:17:50.731 22735 23512 W MountService: 	at android.os.HandlerThread.run(HandlerThread.java:61)
    我们有理由去推断vold的状态是不对的,但是又没有vold的栈。
    那么全局grep vold试试能不能找到线索,在kernellog中发现了vold的异常:
    07-19 14:22:24.669 <3>[10772.492156] c0 Freezing of tasks failed after 20.008 seconds (1 tasks refusing to freeze, wq_busy=0):
    07-19 14:22:24.669 <6>[10772.492217] c0 vold            R running      0   224      1 0x00000001
    07-19 14:22:24.669 <4>[10772.492278] c0 [<c05ebecc>] (__schedule+0x38c/0x5bc) from [<c05ea478>] (schedule_timeout+0x18/0x1e8)
    07-19 14:22:24.669 <4>[10772.492309] c0 [<c05ea478>] (schedule_timeout+0x18/0x1e8) from [<c05eb90c>] (wait_for_common+0x11c/0x164)
    07-19 14:22:24.669 <4>[10772.492309] c0 [<c05eb90c>] (wait_for_common+0x11c/0x164) from [<c03cd8c8>] (mmc_wait_for_req+0xb4/0xe4)
    07-19 14:22:24.669 <4>[10772.492339] c0 [<c03cd8c8>] (mmc_wait_for_req+0xb4/0xe4) from [<c03cd95c>] (mmc_wait_for_cmd+0x64/0x74)
    07-19 14:22:24.669 <4>[10772.492370] c0 [<c03cd95c>] (mmc_wait_for_cmd+0x64/0x74) from [<c03d41f0>] (mmc_send_status+0x6c/0x8c)
    07-19 14:22:24.670 <4>[10772.492400] c0 [<c03d41f0>] (mmc_send_status+0x6c/0x8c) from [<c03d4504>] (sd_send_status+0x14/0x44)
    07-19 14:22:24.670 <4>[10772.492431] c0 [<c03d4504>] (sd_send_status+0x14/0x44) from [<c03d491c>] (mmc_lock_unlock_by_buf+0xac/0x168)
    07-19 14:22:24.670 <4>[10772.492431] c0 [<c03d491c>] (mmc_lock_unlock_by_buf+0xac/0x168) from [<c03dabd8>] (mmc_lockable_store+0x594/0x75c)
    07-19 14:22:24.670 <4>[10772.492461] c0 [<c03dabd8>] (mmc_lockable_store+0x594/0x75c) from [<c029d560>] (dev_attr_store+0x18/0x24)
    07-19 14:22:32.070 <4>[10772.492492] c0 [<c029d560>] (dev_attr_store+0x18/0x24) from [<c013b370>] (sysfs_write_file+0x104/0x148)
    07-19 14:22:32.070 <4>[10772.492522] c0 [<c013b370>] (sysfs_write_file+0x104/0x148) from [<c00eabb4>] (vfs_write+0xd0/0x180)
    07-19 14:22:32.070 <4>[10772.492553] c0 [<c00eabb4>] (vfs_write+0xd0/0x180) from [<c00eb070>] (SyS_write+0x38/0x68)
    07-19 14:22:32.071 <4>[10772.492583] c0 [<c00eb070>] (SyS_write+0x38/0x68) from [<c000e840>] (ret_fast_syscall+0x0/0x30)
    vold一直在这个操作中没有退出来,所以不能响应客户端的请求,从而导致了ANR。
    这个问题需要mmc的同事进一步去分析,目前怀疑是SD卡发生了错误。
    

    3. setting 主线程耗时操作

    "main" prio=5 tid=1 Native
      | group="main" sCount=1 dsCount=0 obj=0x7425caf8 self=0xb4827800
      | sysTid=671 nice=0 cgrp=default sched=0/0 handle=0xb6febbec
      | state=S schedstat=( 0 0 0 ) utm=1853 stm=50 core=2 HZ=100
      | stack=0xbe7bd000-0xbe7bf000 stackSize=8MB
      | held mutexes=
      kernel: (couldn't read /proc/self/task/671/stack)
      native: #00 pc 000133cc  /system/lib/libc.so (syscall+28)
      native: #01 pc 000a9a83  /system/lib/libart.so (art::ConditionVariable::Wait(art::Thread*)+82)
      native: #02 pc 001b16f1  /system/lib/libart.so (art::JNI::NewString(_JNIEnv*, unsigned short const*, int)+640)
      native: #03 pc 00075887  /system/lib/libandroid_runtime.so (???)
      native: #04 pc 008570ab  /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_database_CursorWindow_nativeGetString__JII+110)
      at android.database.CursorWindow.nativeGetString(Native method)
      at android.database.CursorWindow.getString(CursorWindow.java:438)
      at android.database.AbstractWindowedCursor.getString(AbstractWindowedCursor.java:51)
      at android.database.CursorWrapper.getString(CursorWrapper.java:114)
      at com.android.settings.ApnSettings.fillList(ApnSettings.java:259)
      at com.android.settings.ApnSettings.onResume(ApnSettings.java:208)
      at android.app.Fragment.performResume(Fragment.java:2096)
      at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:928)
      at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1067)
      at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1049)
      at android.app.FragmentManagerImpl.dispatchResume(FragmentManager.java:1879)
      at android.app.Activity.performResume(Activity.java:6113)
      at android.app.ActivityThread.performResumeActivity(ActivityThread.java:3015)
      at android.app.ActivityThread.handleResumeActivity(ActivityThread.java:3061)
      at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2415)
      at android.app.ActivityThread.access$800(ActivityThread.java:151)
      at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1313)
      at android.os.Handler.dispatchMessage(Handler.java:102)
      at android.os.Looper.loop(Looper.java:135)
      at android.app.ActivityThread.main(ActivityThread.java:5345)
      at java.lang.reflect.Method.invoke!(Native method)
      at java.lang.reflect.Method.invoke(Method.java:372)
      at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:947)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:742)
      
    
      从ApnSettings.java中可以看到fillList这个可能的耗时操作主线程处理的地方有OnResume和OnReceive。
      且代码中存在使用HandlerThread异步处理这个操作的地方,可以借鉴下将这两处都放到HandlerThread中去做。
      
    

    4. 系统资源紧张造成的ANR

     通常情况下,一般这种问题的表现的可能形式是
         1. traces.txt中的主线程的栈在一个非常common的操作中,如new一个变量,读取某个文件等
         2. ANR的trace中IOW相当高(IOW+CPU sys+usr=100)
         3. Kswapd非常活跃
         4. Low Ram 配置
         5. kernel log中 D状态的用户进程比较多,且都block在内存相关,且内存比较紧张.
         6. 连续出现ANR,slog和ams一直在dumptrace,造成比较大的IO压力
         7. 前台正在玩一个很大的游戏
         8. 某个应用正在安装做dex2oat.
         出现这种情况下,我们需要做的是:
         1. 针对性的调整lmk参数,减少后台的数量
         2. 检查是否有不应该常驻的进程
         3. 是否需要修改伪前台的adj,减少常驻内存
         4. 是否需要关闭dex2oat
         此类问题比较难解决。
    

    5. Binder被占满导致ANR

    1. traces.txt中主线程处于IDLE状态,要么是没有抓到现场,要么是出现了其他问题。
    2. 分析log,找到ANR的时间点如下:后台服务执行超时,这个timeout是200s
       07-28 03:37:33.681  1719  1732 I am_anr  : [0,15625,com.android.exchange,948452933,executing service com.android.exchange/.service.EasService]
       ----- pid 15625 at 2016-07-28 03:37:33 -----,dumptrace的时间点。
    3. 查看是否有主线程的消息:
       07-28 03:37:39.373 15625 15625 W BlockMonitor: Package name: com.android.exchange
        07-28 03:37:39.373 15625 15625 W BlockMonitor: The Message{ what=115 obj=ServiceArgsData{token=android.os.BinderProxy@49814e2 
          startId=6 args=Intent { act=com.android.email.EXCHANGE_INTENT flg=0x4 cmp=com.android.exchange/.service.EasService (has extras) }} 
        target=android.app.ActivityThread$H } took 5174ms.
       之前有说过,华为项目是没有dvmlocksample但是有blockMonitor的,如果有耗时的binder调用以及mesg处理,那么就会被他抓过来。
       从代码中可以看到,what=115是serviceArg消息,是startservice的一部分,而这个消息应该是34秒的时候才接受到并处理的。
       那么,为什么33s的时候dumptrace并没有抓到主线程有卡顿?且后续并没有blockmonitor的打印,也就是说200s内主线程和systemserver并没有出现block。
       那么还有可能是什么原因导致了前200s并没有收到这个消息?
       我们来看下这部分的代码:
    
      private final void sendServiceArgsLocked(ServiceRecord r, boolean execInFg,
                boolean oomAdjusted) throws TransactionTooLargeException {
            final int N = r.pendingStarts.size();
            if (N == 0) {
                return;
            }
    
            while (r.pendingStarts.size() > 0) {
                Exception caughtException = null;
                ServiceRecord.StartItem si;
                try {
                    si = r.pendingStarts.remove(0);
                    if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Sending arguments to: "
                            + r + " " + r.intent + " args=" + si.intent);
                    if (si.intent == null && N > 1) {
                        // If somehow we got a dummy null intent in the middle,
                        // then skip it.  DO NOT skip a null intent when it is
                        // the only one in the list -- this is to support the
                        // onStartCommand(null) case.
                        continue;
                    }
                    si.deliveredTime = SystemClock.uptimeMillis();
                    r.deliveredStarts.add(si);
                    si.deliveryCount++;
                    if (si.neededGrants != null) {
                        mAm.grantUriPermissionUncheckedFromIntentLocked(si.neededGrants,
                                si.getUriPermissionsLocked());
                    }
                    bumpServiceExecutingLocked(r, execInFg, "start"); //超时timer开始计算
                    if (!oomAdjusted) {
                        oomAdjusted = true;
                        mAm.updateOomAdjLocked(r.app);
                    }
                    int flags = 0;
                    if (si.deliveryCount > 1) {
                        flags |= Service.START_FLAG_RETRY;
                    }
                    if (si.doneExecutingCount > 0) {
                        flags |= Service.START_FLAG_REDELIVERY;
                    }
                    r.app.thread.scheduleServiceArgs(r, si.taskRemoved, si.id, flags, si.intent);//通过binder给App发送一个消息,ActivityThread会收到并处理。
                } catch (TransactionTooLargeException e) {
                    if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Transaction too large: intent="
                            + si.intent);
                    caughtException = e;
                } catch (RemoteException e) {
                    // Remote process gone...  we'll let the normal cleanup take care of this.
                    if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Crashed while sending args: " + r);
                    caughtException = e;
                } catch (Exception e) {
                    Slog.w(TAG, "Unexpected exception", e);
                    caughtException = e;
                }
    
                if (caughtException != null) {
                    // Keep nesting count correct
                    final boolean inDestroying = mDestroyingServices.contains(r);
                    serviceDoneExecutingLocked(r, inDestroying, inDestroying);
                    if (caughtException instanceof TransactionTooLargeException) {
                        throw (TransactionTooLargeException)caughtException;
                    }
                    break;
                }
            }
        }
    
    从上面的代码可以看出,在启动计时器和给app发送消息之间并不存在耗时操作,且这个是在同步锁之内的,如果慢了一定会被blockmonitor抓出来。
    那么问题的原因只有一种可能情形:
    200s之前给app发送的消息,APP并没有及时的处理。
    从经验可以猜测,一般这种情况都是app的Binder被占满导致没有空闲的Binder线程能够处理这个消息导致。
    而我们的traces正好dump到了这一幕:
    
    app所有的Binder线程都全部在做一件事情,一个依赖于网络的操作,是被另外一个人通过contentprovider.query调过来的,有点不太正常。
    "Binder_10" prio=5 tid=30 Native
      | group="main" sCount=1 dsCount=0 obj=0x22dba160 self=0xb2f26700
      | sysTid=4613 nice=10 cgrp=bg_non_interactive sched=0/0 handle=0x9eb80930
      | state=S schedstat=( 0 0 0 ) utm=0 stm=0 core=3 HZ=100
      | stack=0x9ea84000-0x9ea86000 stackSize=1014KB
      | held mutexes=
      kernel: (couldn't read /proc/self/task/4613/stack)
      native: #00 pc 000423e8  /system/lib/libc.so (__pselect6+20)
      native: #01 pc 0001c115  /system/lib/libc.so (select+60)
      native: #02 pc 0000a279  /system/lib/libjavacrypto.so (???)
      native: #03 pc 0000b29f  /system/lib/libjavacrypto.so (???)
      native: #04 pc 004bc3fd  /data/dalvik-cache/arm/system@framework@boot.oat (Java_com_android_org_conscrypt_NativeCrypto_SSL_1read__JLjava_io_FileDescriptor_2Lcom_android_org_conscrypt_NativeCrypto_00024SSLHandshakeCallbacks_2_3BIII+192)
      at com.android.org.conscrypt.NativeCrypto.SSL_read(Native method)
      at com.android.org.conscrypt.OpenSSLSocketImpl$SSLInputStream.read(OpenSSLSocketImpl.java:705)
      - locked <0x0508e00e> (a java.lang.Object)
      at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:108)
      at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:196)
      at org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponseParser.java:88)
      at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:179)
      at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:185)
      at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:240)
      at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:264)
      at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:284)
      at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:126)
      at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:440)
      at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:596)
      at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:517)
      at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:495)
      at com.android.exchange.EasResponse.fromHttpRequest(EasResponse.java:91)
      at com.android.exchange.service.EasServerConnection.executeHttpUriRequest(EasServerConnection.java:384)
      at com.android.exchange.eas.EasOperation.performOperation(EasOperation.java:382)
      at com.android.exchange.service.EasService.searchGal(EasService.java:706)
      at com.android.exchange.provider.ExchangeDirectoryProvider.query(ExchangeDirectoryProvider.java:350)
      at android.content.ContentProvider.query(ContentProvider.java:1017)
      at android.content.ContentProvider$Transport.query(ContentProvider.java:238)
      at android.content.ContentProviderNative.onTransact(ContentProviderNative.java:112)
      at android.os.Binder.execTransact(Binder.java:453)
    
    sprdruntime在ANR的时候会dump BinderState,搜索当前进程的pid,查看是否能够找到线索:
    07-28 03:37:33.677  1719  1732 W SprdRuntimeInfo: proc 15625
    07-28 03:37:33.677  1719  1732 W SprdRuntimeInfo:   thread 4537: l 01
    07-28 03:37:33.677  1719  1732 W SprdRuntimeInfo:     incoming transaction 625339: ed9d1ec0 from 30228:4535 to 15625:4537 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600d84
    07-28 03:37:33.677  1719  1732 W SprdRuntimeInfo:   thread 4538: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 625344: c2c47080 from 30228:4536 to 15625:4538 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb601090
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 4563: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 625759: e887e6c0 from 30228:4558 to 15625:4563 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb6013f4
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 4564: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 625805: e6223a40 from 30228:4561 to 15625:4564 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601704
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 4578: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 626144: e96d0cc0 from 30228:4577 to 15625:4578 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601a70
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 4580: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 626162: d0a97440 from 30228:4579 to 15625:4580 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601db0
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 4591: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 626562: d34d60c0 from 30228:4588 to 15625:4591 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb6020ec
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 4592: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 626575: ecb23800 from 30228:4590 to 15625:4592 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602430
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 4602: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 626867: d3c64580 from 30228:4600 to 15625:4602 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602770
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 4603: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 626883: e8b8bfc0 from 30228:4601 to 15625:4603 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602ab4
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 4612: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 627170: e8b8bdc0 from 30228:4610 to 15625:4612 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602df4
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 4613: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 627176: e8ded5c0 from 30228:4611 to 15625:4613 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb603138
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 15635: l 02
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 623807: edbbfa40 from 30228:1612 to 15625:15635 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb600404
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 15636: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 624820: edc2bd80 from 30228:30240 to 15625:15636 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600714
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 15658: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 624824: c2c47400 from 30228:30238 to 15625:15658 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600a20
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:   thread 18343: l 01
    07-28 03:37:33.678  1719  1732 W SprdRuntimeInfo:     incoming transaction 623794: d5132380 from 30228:30241 to 15625:18343 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb6000f4
    
    可以看到我们binder线程被30228给占满了,而同时我们也发现30228也有点不太对:
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo: proc 30228
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:   thread 1612: l 11
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     outgoing transaction 623807: edbbfa40 from 30228:1612 to 15625:15635 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb600404
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     incoming transaction 623738: e994e880 from 2332:4503 to 30228:1612 code 1 flags 10 pri 10 r1 node 538358 size 776:4 data f880009c
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:   thread 4535: l 11
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     outgoing transaction 625339: ed9d1ec0 from 30228:4535 to 15625:4537 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600d84
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     incoming transaction 625317: e82f4000 from 2332:4556 to 30228:4535 code 1 flags 10 pri 10 r1 node 538358 size 776:4 data f8800d6c
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:   thread 4536: l 11
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     outgoing transaction 625344: c2c47080 from 30228:4536 to 15625:4538 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb601090
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     incoming transaction 625326: ed9d1c40 from 2332:4557 to 30228:4536 code 1 flags 10 pri 10 r1 node 538358 size 776:4 data f88010a0
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:   thread 4558: l 11
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     outgoing transaction 625759: e887e6c0 from 30228:4558 to 15625:4563 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb6013f4
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     incoming transaction 625733: e887e5c0 from 2332:4575 to 30228:4558 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f88013d4
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:   thread 4561: l 11
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     outgoing transaction 625805: e6223a40 from 30228:4561 to 15625:4564 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601704
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     incoming transaction 625756: e8719880 from 2332:4576 to 30228:4561 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f880170c
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:   thread 4577: l 11
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     outgoing transaction 626144: e96d0cc0 from 30228:4577 to 15625:4578 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601a70
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     incoming transaction 626125: d34d6ec0 from 2332:4585 to 30228:4577 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f8801a44
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:   thread 4579: l 11
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     outgoing transaction 626162: d0a97440 from 30228:4579 to 15625:4580 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601db0
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     incoming transaction 626131: c3f98200 from 2332:4587 to 30228:4579 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f8801d7c
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:   thread 4588: l 11
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     outgoing transaction 626562: d34d60c0 from 30228:4588 to 15625:4591 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb6020ec
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     incoming transaction 626541: ede63540 from 2332:4598 to 30228:4588 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f88020b4
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:   thread 4590: l 11
    07-28 03:37:33.562  1719  1732 W SprdRuntimeInfo:     outgoing transaction 626575: ecb23800 from 30228:4590 to 15625:4592 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602430
    07-28 03:37:33.563  1719  1732 W SprdRuntimeInfo:     incoming transaction 626546: e9a32b80 from 2332:4599 to 30228:4590 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f88023f0
    07-28 03:37:33.563  1719  1732 W SprdRuntimeInfo:   thread 4600: l 11
    07-28 03:37:33.563  1719  1732 W SprdRuntimeInfo:     outgoing transaction 626867: d3c64580 from 30228:4600 to 15625:4602 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602770
    07-28 03:37:33.563  1719  1732 W SprdRuntimeInfo:     incoming transaction 626849: ecb23e00 from 2332:4608 to 30228:4600 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f880272c
    07-28 03:37:33.563  1719  1732 W SprdRuntimeInfo:   thread 4601: l 11
    07-28 03:37:33.563  1719  1732 W SprdRuntimeInfo:     outgoing transaction 626883: e8b8bfc0 from 30228:4601 to 15625:4603 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602ab4
    07-28 03:37:33.563  1719  1732 W SprdRuntimeInfo:     incoming transaction 626856: e8389f00 from 2332:4609 to 30228:4601 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f8802a68
    07-28 03:37:33.563  1719  1732 W SprdRuntimeInfo:   thread 4610: l 11
    07-28 03:37:33.563  1719  1732 W SprdRuntimeInfo:     outgoing transaction 627170: e8b8bdc0 from 30228:4610 to 15625:4612 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602df4
    07-28 03:37:33.563  1719  1732 W SprdRuntimeInfo:     incoming transaction 627144: c27891c0 from 2332:4618 to 30228:4610 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f8802da4
    07-28 03:37:33.563  1719  1732 W SprdRuntimeInfo:   thread 4611: l 11
    07-28 03:37:33.563  1719  1732 W SprdRuntimeInfo:     outgoing transaction 627176: e8ded5c0 from 30228:4611 to 15625:4613 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb603138
    07-28 03:37:33.563  1719  1732 W SprdRuntimeInfo:     incoming transaction 627151: e2b511c0 from 2332:4619 to 30228:4611 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f88030e0
    
    也就是说实际上是2332query了30228,30228才会去query16625导致anr的。
    那么2332是谁?
    eventslog中可以看到2332是email,30228是android.process.acore。
    搜索traces.txt,看到我们在anr的时候有dump到2332的栈:
    发现了异常,有很多个线程(也就是上面dump到的2332的线程)在查询,
    "Filter" prio=5 tid=42 Native
      | group="main" sCount=1 dsCount=0 obj=0x130cc8e0 self=0x98637500
      | sysTid=4502 nice=10 cgrp=bg_non_interactive sched=0/0 handle=0x9877e930
      | state=S schedstat=( 0 0 0 ) utm=1 stm=0 core=0 HZ=100
      | stack=0x9867c000-0x9867e000 stackSize=1038KB
      | held mutexes=
      kernel: (couldn't read /proc/self/task/4502/stack)
      native: #00 pc 00017618  /system/lib/libc.so (syscall+28)
      native: #01 pc 000f5cbd  /system/lib/libart.so (_ZN3art17ConditionVariable4WaitEPNS_6ThreadE+80)
      native: #02 pc 00275df1  /system/lib/libart.so (_ZN3art3JNI18CallBooleanMethodVEP7_JNIEnvP8_jobjectP10_jmethodIDSt9__va_list+592)
      native: #03 pc 0008617b  /system/lib/libandroid_runtime.so (???)
      native: #04 pc 00089c2b  /system/lib/libandroid_runtime.so (???)
      native: #05 pc 000198dd  /system/lib/libbinder.so (_ZN7android7BBinder8transactEjRKNS_6ParcelEPS1_j+60)
      native: #06 pc 0001ec25  /system/lib/libbinder.so (_ZN7android14IPCThreadState14executeCommandEi+584)
      native: #07 pc 0001ef73  /system/lib/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+262)
      native: #08 pc 0001f049  /system/lib/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+124)
      native: #09 pc 00019fe3  /system/lib/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+30)
      native: #10 pc 0008a035  /system/lib/libandroid_runtime.so (???)
      native: #11 pc 00d78919  /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140)
      at android.os.BinderProxy.transactNative(Native method)
      at android.os.BinderProxy.transact(Binder.java:505)
      at android.content.ContentProviderProxy.query(ContentProviderNative.java:419)
      at android.content.ContentResolver.query(ContentResolver.java:502)
      at android.content.ContentResolver.query(ContentResolver.java:438)
      at com.huawei.mail.chips.BaseRecipientAdapter.doQuery(BaseRecipientAdapter.java:931)
      at com.huawei.mail.chips.BaseRecipientAdapter$DirectoryFilter.performFiltering(BaseRecipientAdapter.java:400)
      at android.widget.Filter$RequestHandler.handleMessage(Filter.java:234)
      at android.os.Handler.dispatchMessage(Handler.java:102)
      at android.os.Looper.loop(Looper.java:150)
      at android.os.HandlerThread.run(HandlerThread.java:61)
    
    那么接下来就需要结合代码查看这么多线程创建的原因。
    解决的思路也比较简单,是否需要采取线程池或者限制下线程的个数,防止由于此操作比较耗时(依赖于网络)导致对端ANR。
    
    1. Bug 582473 single touch point test,操作15次以上必现工厂测试anr
       从dropbox中找到Validationtools对应的traces,查看主线程的栈,发现主线程在IDLE状态。
          
    "main" prio=5 tid=1 NATIVE
      | group="main" sCount=1 dsCount=0 obj=0x4155acc0 self=0x414943f0
      | sysTid=3287 nice=-1 sched=0/0 cgrp=apps handle=1074065748
      | state=S schedstat=( 0 0 0 ) utm=1241 stm=340 core=0
      #00  pc 00021764  /system/lib/libc.so (epoll_wait+12)
      #01  pc 000105e3  /system/lib/libutils.so (android::Looper::pollInner(int)+94)
      #02  pc 00010811  /system/lib/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+92)
      #03  pc 0006b345  /system/lib/libandroid_runtime.so (android::NativeMessageQueue::pollOnce(_JNIEnv*, int)+22)
      #04  pc 0001e70c  /system/lib/libdvm.so (dvmPlatformInvoke+112)
      #05  pc 0004dc43  /system/lib/libdvm.so (dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+398)
      #06  pc 00027778  /system/lib/libdvm.so
      #07  pc 0002e410  /system/lib/libdvm.so (dvmMterpStd(Thread*)+76)
      #08  pc 0002bac8  /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+156)
      #09  pc 00060259  /system/lib/libdvm.so (dvmInvokeMethod(Object*, Method const*, ArrayObject*, ArrayObject*, ClassObject*, bool)+392)
      #10  pc 000683eb  /system/lib/libdvm.so
      #11  pc 00027778  /system/lib/libdvm.so
      #12  pc 0002e410  /system/lib/libdvm.so (dvmMterpStd(Thread*)+76)
      #13  pc 0002bac8  /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+156)
      #14  pc 0005ff77  /system/lib/libdvm.so (dvmCallMethodV(Thread*, Method const*, Object*, bool, JValue*, std::__va_list)+338)
      #15  pc 0004982b  /system/lib/libdvm.so
      #16  pc 0004de13  /system/lib/libandroid_runtime.so
      #17  pc 0004ecdd  /system/lib/libandroid_runtime.so (android::AndroidRuntime::start(char const*, char const*)+532)
      #18  pc 00001423  /system/bin/app_process
      #19  pc 0000e403  /system/lib/libc.so (__libc_init+50)
      #20  pc 00000f34  /system/bin/app_process
      at android.os.MessageQueue.nativePollOnce(Native Method)
      at android.os.MessageQueue.next(MessageQueue.java:138)
      at android.os.Looper.loop(Looper.java:123)
      at android.app.ActivityThread.main(ActivityThread.java:5294)
      at java.lang.reflect.Method.invokeNative(Native Method)
      at java.lang.reflect.Method.invoke(Method.java:515)
      at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:932)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:748)
      at dalvik.system.NativeStart.main(Native Method)
      然后发现第二个线程是Binder10 且在block状态,通常情况下,一个不是非常繁忙的需要对外提供服务的app是不许要16个binder线程的。 
      于是接着查看是否其他binder也存在异常,结果发现所有的binder线程均处于同样的调用栈。
    
    "Binder_D" prio=5 tid=21 WAIT
      | group="main" sCount=1 dsCount=0 obj=0x41d6f468 self=0x555f4898
      | sysTid=3344 nice=0 sched=0/0 cgrp=apps handle=1442108096
      | state=S schedstat=( 0 0 0 ) utm=0 stm=0 core=3
      at java.lang.Object.wait(Native Method)
      - waiting on <0x41d6f4c0> (a java.lang.VMThread) held by tid=21 (Binder_D)
      at java.lang.Thread.parkFor(Thread.java:1205)
      at sun.misc.Unsafe.park(Unsafe.java:325)
      at java.util.concurrent.locks.LockSupport.park(LockSupport.java:157)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:813)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:846)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1175)
      at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:180)
      at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:256)
      at android.view.SurfaceView$MyWindow.resized(SurfaceView.java:655)
      at android.view.IWindow$Stub.onTransact(IWindow.java:108)
      at android.os.Binder.execTransact(Binder.java:427)
      at dalvik.system.NativeStart.run(Native Method)
    
    "Binder_F" prio=5 tid=23 WAIT
      | group="main" sCount=1 dsCount=0 obj=0x41d4e600 self=0x55cb63f8
      | sysTid=3351 nice=0 sched=0/0 cgrp=apps handle=1435714912
      | state=S schedstat=( 0 0 0 ) utm=1 stm=1 core=0
      at java.lang.Object.wait(Native Method)
      - waiting on <0x41d4e658> (a java.lang.VMThread) held by tid=23 (Binder_F)
      at java.lang.Thread.parkFor(Thread.java:1205)
      at sun.misc.Unsafe.park(Unsafe.java:325)
      at java.util.concurrent.locks.LockSupport.park(LockSupport.java:157)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:813)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:846)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1175)
      at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:180)
      at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:256)
      at android.view.SurfaceView$MyWindow.resized(SurfaceView.java:655)
      at android.view.IWindow$Stub.onTransact(IWindow.java:108)
      at android.os.Binder.execTransact(Binder.java:427)
      at dalvik.system.NativeStart.run(Native Method)
    
    如上面一个例子,在这种Binder满的情况下,systemserver和app之间的通信是被block的,虽然这个是4.4并没有sprdruntimeInfo给我们dump出binder的状态,我们还是可以推断出
    ANR的直接原因就是此处不合常理的Binder被占满。
    首先通过代码可以看到,surfaceView中这行对应的代码为如下:
    
      @Override
            public void resized(Rect frame, Rect overscanInsets, Rect contentInsets,
                    Rect visibleInsets, boolean reportDraw, Configuration newConfig) {
                SurfaceView surfaceView = mSurfaceView.get();
                if (surfaceView != null) {
                    if (DEBUG) Log.v(
                            "SurfaceView", surfaceView + " got resized: w=" + frame.width()
                            + " h=" + frame.height() + ", cur w=" + mCurWidth + " h=" + mCurHeight);
                    surfaceView.mSurfaceLock.lock();
                    try {
                        if (reportDraw) {
                            surfaceView.mUpdateWindowNeeded = true;
                            surfaceView.mReportDrawNeeded = true;
                            surfaceView.mHandler.sendEmptyMessage(UPDATE_WINDOW_MSG);
                        } else if (surfaceView.mWinFrame.width() != frame.width()
                                || surfaceView.mWinFrame.height() != frame.height()) {
                            surfaceView.mUpdateWindowNeeded = true;
                            surfaceView.mHandler.sendEmptyMessage(UPDATE_WINDOW_MSG);
                        }
                    } finally {
                        surfaceView.mSurfaceLock.unlock();
                    }
                }
            }
           mSurfaceLocklock的路径还有如下:
            private void updateWindow(boolean force, boolean redrawNeeded) 
            private final Canvas internalLockCanvas(Rect dirty)
    由于这个是个必现问题,那么问题就简单了,我们可以在所有会lock和unlock的地方都加上log,再复现一次来得到真相。
    于是乎发现是在SingleTouchPointTest pass的时候忘记unlockCanvs导致。
    

    套路总结

       1. 查看traces.txt,如果有block栈则通过栈来找到block原因。
       2. 应用自身耗时操作尝试修改异步,被Binder block则进一步确认下对端的情况,如果是依赖于外部状态的可能block操作,应用修改为异步,如果是common接口,转FW
       3. 如果不存在栈,那么尽人事听天命,查看ANR点附近是否有主线程block的可能信息,如果有则顺腾摸瓜,没有就Needinfo。
    

    如何避免ANR

              A.减少复杂的layout
    	  B.主线程尽量不要做和显示无关的事情。
    	  C.如果存在可能会block、耗时的操作,不要放到主线程中,可以使用异步等方式来放到另外一个handlerthread或者asynctask中去做。
    	    需要注意的是,使用handler.post或者sendmessage的时候需要确认清楚handler的looper,如果是主线程,依然可能会ANR。
    	    在HandlerThread中如果需要处理和显示相关的,还需要到主线程中处理(非UI线程不能操作UI,在ViewRoot中有检查)
    	  
    

    Android 开发

  • 相关阅读:
    Go 语言简介(下)— 特性
    Array.length vs Array.prototype.length
    【转】javascript Object使用Array的方法
    【转】大话程序猿眼里的高并发架构
    【转】The magic behind array length property
    【转】Build Your own Simplified AngularJS in 200 Lines of JavaScript
    【转】在 2016 年做 PHP 开发是一种什么样的体验?(一)
    【转】大话程序猿眼里的高并发
    php通过token验证表单重复提交
    windows 杀进程软件
  • 原文地址:https://www.cnblogs.com/dirt2/p/6138826.html
Copyright © 2011-2022 走看看