zoukankan      html  css  js  c++  java
  • 排查golang的性能问题 go pprof 实践

     小结:

    1、内存消耗分析 list peek  定位到函数

     

     

    https://mp.weixin.qq.com/s/_LovnIqJYAuDpTm2QmUgrA

     

    使用pprof和go-torch排查golang的性能问题


    最近线上服务压力很大,api的p99有点扛不住。

    广告业务对延时的要求普遍比较严格,有些adx设置的超时时间低至100ms,因此亟需找出性能热点。

    根据对目前系统情况的估计(和metrics埋点数据),大致估计问题出在广告的正排环节。

    使用 pprof  也证明了这一块确实是热点:

     
    $ go tool pprof http://$IP:$PORT/debug/pprof/profile...(pprof) top 10Showing nodes accounting for 25.50s, 32.63% of 78.14s totalDropped 1533 nodes (cum <= 0.39s)Showing top 10 nodes out of 284      flat  flat%  sum%        cum  cum%    4.56s  5.84%  5.84%      4.87s  6.23%  syscall.Syscall    4.03s  5.16% 10.99%      4.03s  5.16%  runtime.aeshashbody    3.50s  4.48% 15.47%      6.01s  7.69%  git.xxx.org/xxx/target.NewFilter    2.78s  3.56% 19.03%      3.73s  4.77%  runtime.mapaccess2_fast64    2.63s  3.37% 22.40%      4.52s  5.78%  runtime.mapiternext    2.08s  2.66% 25.06%      2.16s  2.76%  runtime.heapBitsForObject    1.65s  2.11% 27.17%      1.93s  2.47%  runtime.mapaccess1_fast64    1.57s  2.01% 29.18%      2.96s  3.79%  runtime.mapaccess2    1.43s  1.83% 31.01%      2.06s  2.64%  runtime.runqgrab    1.27s  1.63% 32.63%      1.27s  1.63%  runtime.epollwait(pprof) pngGenerating report in profile001.png (使用生成的线框图查看耗时)

    其中第三行 NewFilter 就是正排过滤函数。因为一些历史原因,系统里不是所有定向条件都使用了倒排,正排实现起来毕竟简单、容易理解,而一旦开了这个口子,就会有越来越多正排加进来,推测是这个原因导致了性能的逐渐衰退。

    经过讨论,D同学花了一周多的时间,逐个梳理重写。在Libra(字节跳动内部的ABTest平台,参考谷歌分层实验框架方案)上开实验,平均耗时 -9%,从统计数据上来说,实验组比对照组有了显著的改进,但从最终结果上,整体的p95、p99超时都出现了进一步恶化。

    这说明真正的问题不在于正排的计算,优化的思路出现了偏差。

    考虑到晚高峰期间的cpu占用率也只是刚超过50%,也就是说有可能性能问题在于锁,但pprof的 block 和 mutex 都是空的,没有线索。

    猜测问题有可能在日志,代码里确实用得不少。日志用的是 github.com/ngaut/logging 库,每一次调用都会用到两个全局mutex。但通过调整log level 为error级别,大幅减少了日志量,并没有看到性能的改善。

    经过搜索,发现 uber 基于 pprof 开发了一个神器 go-torch,可以生成火焰图。安装好 go-torch 及依赖 FlameGraph 以后执行:

    $ go-torch  -u http://$IP:$PORT -f cpu.svgINFO[14:52:23] Run pprof command: go tool pprof -raw -seconds 30 http://$IP:$PORT/debug/pprof/profileINFO[14:52:54] Writing svg to cpu.svg

    用 Chrome 打开 cpu.svg,人肉排查:

    可以看到,在NewFilter旁边竟然还有一个耗时接近的 runtime.growslice ,结合实际代码(略作简化),可以推测是 slice 的初始化长度不足。

    matchAds := make([]*ad, 0, 4096)adsBitMap.GetList(func(seq int) {    if NewFilter(ctx, ad) {        matchAds = append(matchAds, adlist[seq])    }})
    // 顺便提一下,bitmap是一个uint64数组,GetList(f) 是将每一个等于1的bit索引传给 f// GetList方法里面用了cpu的BSF指令来提高性能。

    实际上最终定向后得到的广告往往在数万甚至数十万的级别,而 go 的 slice 扩容在超过1024个元素以后是1.25倍,可想而知会出现大量的内存分配和拷贝,导致性能随着广告数量的增加逐渐恶化。最近的广告数量也确实有了大幅的上升 —— 逻辑上形成了闭环。

    经过优化,使用更大的初始化长度,并且使用 sync.Pool 来进一步减少内存分配,最终上线后p95和p99都下降了超过50%,效果显著。

    参考文章:
    golang 使用pprof和go-torch做性能分析 

    https://www.cnblogs.com/li-peng/p/9391543.html

    Go语言性能优化工具pprof文本输出的含义_梁喜健_新浪博客 http://blog.sina.com.cn/s/blog_48c95a190102xtse.html

    Go语言性能优化工具pprof文本输出的含义

     (2018-10-13 12:02:19)
     
        对于广大gopher来说,pprof是做性能优化时经常会用到的一个工具,但很多刚刚入坑的开发者难免会对pprof文本输出中的某些字段感到困惑,这里不妨做一个简单的说明。以下是一段典型的pprof函数耗时的文本输出,其中前几行比较容易理解,其说明了显示出来的函数耗时总计占用了5.73s,而全部耗时为6.21秒,所以显示出来的函数耗时占总体的92.27%,其中有cum耗时小于0.03秒的67个函数耗时被丢弃而未予显示。
     
    (pprof) top78
     Showing nodes accounting for 5.73s, 92.27% of 6.21s total
     Dropped 67 nodes (cum <= 0.03s)
     Showing top 78 nodes out of 79
          flat  flat%   sum%        cum   cum%
         4.14s 66.67% 66.67%      4.14s 66.67%  runtime.kevent /usr/local/go/src/runtime/sys_darwin_amd64.s
         0.52s  8.37% 75.04%      0.52s  8.37%  runtime.mach_semaphore_signal /usr/local/go/src/runtime/sys_darwin_amd64.s
         0.34s  5.48% 80.52%      0.34s  5.48%  runtime.mach_semaphore_timedwait /usr/local/go/src/runtime/sys_darwin_amd64.s
         0.17s  2.74% 83.25%      0.17s  2.74%  runtime.mach_semaphore_wait /usr/local/go/src/runtime/sys_darwin_amd64.s
         0.13s  2.09% 85.35%      0.13s  2.09%  runtime.scanstack /usr/local/go/src/runtime/mgcmark.go
         0.07s  1.13% 86.47%      0.07s  1.13%  runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s
         0.06s  0.97% 87.44%      0.06s  0.97%  runtime.usleep /usr/local/go/src/runtime/sys_darwin_amd64.s
     
        接下来的一大坨文本涉及到了这样几列字段:flat和flat%、sum%、cum和cum%,其中flat代表的是该函数自身代码的执行时长,而cum代表的是该函数自身代码+所有调用的函数的执行时长。这样解释可能不太直观,我们以下面的例子来说明,函数b由三部分组成:调用函数c、自己直接处理一些事情、调用函数d,其中调用函数c耗时1秒,自己直接处理事情耗时3秒,调用函数d耗时2秒,那么函数b的flat耗时就是3秒,cum耗时就是6秒。
     
    func b() {
         c() // takes 1s
         do something directly // takes 3s
         d() // takes 2s
    }
     
        至于flat%和cum%指的就是flat耗时和cum耗时占总耗时(也就是6.21秒)的百分比,而最后一个sum%指的就是每一行的flat%与上面所有行的flat%总和,代表从上到下的累计值,比如第二行的75.04%就等于第一行flat%的66.67%+本行flat%的8.37%,下面的以此类推。
     
     

    https://mp.weixin.qq.com/s/4Vn1Rq82wOFiLdEmjXU0fw

     

    go pprof与线上事故:一次成功的定位与失败的复现

     
    •  Flat:函数自身运行耗时

    • Flat%:函数自身耗时比例
    • Sum%:指的就是每一行的flat%与上面所有行的flat%总和
    • Cum:当前函数加上它之上的调用运行总耗时
    • Cum%:当前函数加上它之上的调用运行总耗时比例

    举例说明:函数b由三部分组成:调用函数c、自己直接处理一些事情、调用函数d,其中调用函数c耗时1秒,自己直接处理事情耗时3秒,调用函数d耗时2秒,那么函数bflat耗时就是3秒,cum耗时就是6秒。

    // 该示例在文末参考列表的博客中func b() {    c() // takes 1s    do something directly // takes 3s    d() // takes 2s}


    [test@demo ~]$ go tool pprof   http://localhost:8003/debug/pprof/profile
    Fetching profile over HTTP from http://localhost:8003/debug/pprof/profile
    Saved profile in /home/test/pprof/pprof.admin.samples.cpu.010.pb.gz
    File: admin
    Type: cpu
    Time: Dec 24, 2020 at 9:48am (CST)
    Duration: 30s, Total samples = 660ms ( 2.20%)
    Entering interactive mode (type "help" for commands, "o" for options)
    (pprof) top32
    Showing nodes accounting for 600ms, 90.91% of 660ms total
    Showing top 32 nodes out of 228
          flat  flat%   sum%        cum   cum%
         130ms 19.70% 19.70%      160ms 24.24%  syscall.Syscall
         100ms 15.15% 34.85%      100ms 15.15%  runtime.futex
          30ms  4.55% 39.39%       30ms  4.55%  runtime.epollwait
          30ms  4.55% 43.94%       30ms  4.55%  syscall.Syscall6
          20ms  3.03% 46.97%       20ms  3.03%  runtime.mapassign_faststr
          20ms  3.03% 50.00%       20ms  3.03%  runtime.mapiterinit
          20ms  3.03% 53.03%       20ms  3.03%  runtime.usleep
          10ms  1.52% 54.55%      120ms 18.18%  AdminSite/middleware/auth/v1.(*auth0Client).CheckAdminAuth
    
    // 注意此处,该函数内部发起了grpc请求,请求服务端认证token,故flat 10ms 而 cum 120ms	。
    	  
          10ms  1.52% 56.06%       10ms  1.52%  aeshashbody
          10ms  1.52% 57.58%       20ms  3.03%  github.com/go-kratos/kratos/pkg/net/http/blademaster.parseMetadataTo
          10ms  1.52% 59.09%       10ms  1.52%  golang.org/x/net/http2.(*Framer).WriteHeaders
          10ms  1.52% 60.61%       10ms  1.52%  google.golang.org/grpc.(*pickerWrapper).pick
          10ms  1.52% 62.12%       10ms  1.52%  google.golang.org/grpc/internal/transport.(*http2Client).handleData
          10ms  1.52% 63.64%       10ms  1.52%  google.golang.org/grpc/internal/transport.(*outStreamList).dequeue
          10ms  1.52% 65.15%       10ms  1.52%  google.golang.org/protobuf/internal/impl.(*stringConverter).PBValueOf
          10ms  1.52% 66.67%       10ms  1.52%  internal/poll.runtime_pollSetDeadline
          10ms  1.52% 68.18%       10ms  1.52%  memeqbody
          10ms  1.52% 69.70%       10ms  1.52%  net.sockaddrToTCP
          10ms  1.52% 71.21%       10ms  1.52%  net/http.Header.sortedKeyValues
          10ms  1.52% 72.73%       10ms  1.52%  net/url.escape
          10ms  1.52% 74.24%       10ms  1.52%  runtime.(*spanSet).push
          10ms  1.52% 75.76%       10ms  1.52%  runtime.acquireSudog
          10ms  1.52% 77.27%       10ms  1.52%  runtime.checkTimers
          10ms  1.52% 78.79%       10ms  1.52%  runtime.efaceeq
          10ms  1.52% 80.30%       10ms  1.52%  runtime.gcWriteBarrier
          10ms  1.52% 81.82%       10ms  1.52%  runtime.gentraceback
          10ms  1.52% 83.33%       10ms  1.52%  runtime.getitab
          10ms  1.52% 84.85%       30ms  4.55%  runtime.mallocgc
          10ms  1.52% 86.36%       10ms  1.52%  runtime.mapaccess1
          10ms  1.52% 87.88%       20ms  3.03%  runtime.mapaccess1_faststr
          10ms  1.52% 89.39%       10ms  1.52%  runtime.mapassign
          10ms  1.52% 90.91%       10ms  1.52%  runtime.nextFreeFast (inline)
    (pprof) top32 -cum  
    
    // 按照cum逆序
    
    Showing nodes accounting for 240ms, 36.36% of 660ms total
    Showing top 32 nodes out of 228
          flat  flat%   sum%        cum   cum%
             0     0%     0%      320ms 48.48%  net/http.(*conn).serve
             0     0%     0%      240ms 36.36%  github.com/go-kratos/kratos/pkg/net/http/blademaster.(*Engine).ServeHTTP
             0     0%     0%      240ms 36.36%  github.com/go-kratos/kratos/pkg/net/http/blademaster.(*Engine).handleContext
             0     0%     0%      240ms 36.36%  net/http.serverHandler.ServeHTTP
             0     0%     0%      200ms 30.30%  github.com/go-kratos/kratos/pkg/net/http/blademaster.(*Context).Next
             0     0%     0%      200ms 30.30%  github.com/go-kratos/kratos/pkg/net/http/blademaster.HandlerFunc.ServeHTTP
             0     0%     0%      200ms 30.30%  github.com/go-kratos/kratos/pkg/net/http/blademaster.Recovery.func1
             0     0%     0%      200ms 30.30%  github.com/go-kratos/kratos/pkg/net/http/blademaster.Trace.func1
             0     0%     0%      190ms 28.79%  github.com/go-kratos/kratos/pkg/net/http/blademaster.Logger.func1
         130ms 19.70% 19.70%      160ms 24.24%  syscall.Syscall
             0     0% 19.70%      150ms 22.73%  internal/poll.ignoringEINTR
             0     0% 19.70%      120ms 18.18%  AdminSite/internal/server/http.MidAuthHandler
          10ms  1.52% 21.21%      120ms 18.18%  AdminSite/middleware/auth/v1.(*auth0Client).CheckAdminAuth
             0     0% 21.21%      110ms 16.67%  github.com/go-kratos/kratos/pkg/net/rpc/warden.(*Client).recovery.func1
             0     0% 21.21%      110ms 16.67%  github.com/go-kratos/kratos/pkg/net/rpc/warden.chainUnaryClient.func2
             0     0% 21.21%      110ms 16.67%  github.com/go-kratos/kratos/pkg/net/rpc/warden.chainUnaryClient.func2.1
             0     0% 21.21%      110ms 16.67%  github.com/go-kratos/kratos/pkg/net/rpc/warden.clientLogging.func1
             0     0% 21.21%      110ms 16.67%  google.golang.org/grpc.(*ClientConn).Invoke
         100ms 15.15% 36.36%      100ms 15.15%  runtime.futex
             0     0% 36.36%       90ms 13.64%  google.golang.org/grpc/internal/transport.(*loopyWriter).run
             0     0% 36.36%       90ms 13.64%  google.golang.org/grpc/internal/transport.newHTTP2Client.func3
             0     0% 36.36%       90ms 13.64%  runtime.findrunnable
             0     0% 36.36%       90ms 13.64%  runtime.mcall
             0     0% 36.36%       90ms 13.64%  runtime.schedule
             0     0% 36.36%       80ms 12.12%  internal/poll.(*FD).Read
             0     0% 36.36%       80ms 12.12%  internal/poll.(*FD).Read.func1
             0     0% 36.36%       80ms 12.12%  net.(*conn).Read
             0     0% 36.36%       80ms 12.12%  net.(*netFD).Read
             0     0% 36.36%       80ms 12.12%  runtime.park_m
             0     0% 36.36%       80ms 12.12%  syscall.Read (inline)
             0     0% 36.36%       80ms 12.12%  syscall.read
             0     0% 36.36%       70ms 10.61%  github.com/go-kratos/kratos/pkg/net/rpc/warden.(*Client).handle.func1
    (pprof) top32 -flat
    
    // 按照flat逆序
    
    Ignore expression matched no samples
    Active filters:
       ignore=flat
    Showing nodes accounting for 600ms, 90.91% of 660ms total
    Showing top 32 nodes out of 228
          flat  flat%   sum%        cum   cum%
         130ms 19.70% 19.70%      160ms 24.24%  syscall.Syscall
         100ms 15.15% 34.85%      100ms 15.15%  runtime.futex
          30ms  4.55% 39.39%       30ms  4.55%  runtime.epollwait
          30ms  4.55% 43.94%       30ms  4.55%  syscall.Syscall6
          20ms  3.03% 46.97%       20ms  3.03%  runtime.mapassign_faststr
          20ms  3.03% 50.00%       20ms  3.03%  runtime.mapiterinit
          20ms  3.03% 53.03%       20ms  3.03%  runtime.usleep
          10ms  1.52% 54.55%      120ms 18.18%  AdminSite/middleware/auth/v1.(*auth0Client).CheckAdminAuth
          10ms  1.52% 56.06%       10ms  1.52%  aeshashbody
          10ms  1.52% 57.58%       20ms  3.03%  github.com/go-kratos/kratos/pkg/net/http/blademaster.parseMetadataTo
          10ms  1.52% 59.09%       10ms  1.52%  golang.org/x/net/http2.(*Framer).WriteHeaders
          10ms  1.52% 60.61%       10ms  1.52%  google.golang.org/grpc.(*pickerWrapper).pick
          10ms  1.52% 62.12%       10ms  1.52%  google.golang.org/grpc/internal/transport.(*http2Client).handleData
          10ms  1.52% 63.64%       10ms  1.52%  google.golang.org/grpc/internal/transport.(*outStreamList).dequeue
          10ms  1.52% 65.15%       10ms  1.52%  google.golang.org/protobuf/internal/impl.(*stringConverter).PBValueOf
          10ms  1.52% 66.67%       10ms  1.52%  internal/poll.runtime_pollSetDeadline
          10ms  1.52% 68.18%       10ms  1.52%  memeqbody
          10ms  1.52% 69.70%       10ms  1.52%  net.sockaddrToTCP
          10ms  1.52% 71.21%       10ms  1.52%  net/http.Header.sortedKeyValues
          10ms  1.52% 72.73%       10ms  1.52%  net/url.escape
          10ms  1.52% 74.24%       10ms  1.52%  runtime.(*spanSet).push
          10ms  1.52% 75.76%       10ms  1.52%  runtime.acquireSudog
          10ms  1.52% 77.27%       10ms  1.52%  runtime.checkTimers
          10ms  1.52% 78.79%       10ms  1.52%  runtime.efaceeq
          10ms  1.52% 80.30%       10ms  1.52%  runtime.gcWriteBarrier
          10ms  1.52% 81.82%       10ms  1.52%  runtime.gentraceback
          10ms  1.52% 83.33%       10ms  1.52%  runtime.getitab
          10ms  1.52% 84.85%       30ms  4.55%  runtime.mallocgc
          10ms  1.52% 86.36%       10ms  1.52%  runtime.mapaccess1
          10ms  1.52% 87.88%       20ms  3.03%  runtime.mapaccess1_faststr
          10ms  1.52% 89.39%       10ms  1.52%  runtime.mapassign
          10ms  1.52% 90.91%       10ms  1.52%  runtime.nextFreeFast (inline)
    (pprof)
    
     

    go tool pprof http://127.0.0.1:123/debug/pprof/heap

    使用list命令直接可以查看到具体是哪一行分配了多少内存

              list         Output annotated source for functions matching regexp

             pdf          Outputs a graph in PDF format

             peek         Output callers/callees of functions matching regexp

    (pprof) list new
    Total: 1.02MB
    ROUTINE ======================== google.golang.org/grpc/internal/transport.newFramer in /home/shawn/go/pkg/mod/google.golang.org/grpc@v1.29.1/internal/transport/http_util.go
             0   528.17kB (flat, cum) 50.74% of Total
             .          .    658:   if writeBufferSize < 0 {
             .          .    659:           writeBufferSize = 0
             .          .    660:   }
             .          .    661:   var r io.Reader = conn
             .          .    662:   if readBufferSize > 0 {
             .   528.17kB    663:           r = bufio.NewReaderSize(r, readBufferSize)
             .          .    664:   }
             .          .    665:   w := newBufWriter(conn, writeBufferSize)
             .          .    666:   f := &framer{
             .          .    667:           writer: w,
             .          .    668:           fr:     http2.NewFramer(w, r),
    ROUTINE ======================== google.golang.org/grpc/internal/transport.newHTTP2Client in /home/shawn/go/pkg/mod/google.golang.org/grpc@v1.29.1/internal/transport/http2_client.go
             0   528.17kB (flat, cum) 50.74% of Total
             .          .    244:           localAddr:             conn.LocalAddr(),
             .          .    245:           authInfo:              authInfo,
             .          .    246:           readerDone:            make(chan struct{}),
             .          .    247:           writerDone:            make(chan struct{}),
             .          .    248:           goAway:                make(chan struct{}),
             .   528.17kB    249:           framer:                newFramer(conn, writeBufSize, readBufSize, maxHeaderListSize),
             .          .    250:           fc:                    &trInFlow{limit: uint32(icwz)},
             .          .    251:           scheme:                scheme,
             .          .    252:           activeStreams:         make(map[uint32]*Stream),
             .          .    253:           isSecure:              isSecure,
             .          .    254:           perRPCCreds:           perRPCCreds,
    (pprof) 
    
    
    (pprof) peek new
    Showing nodes accounting for 1040.92kB, 100% of 1040.92kB total
    ----------------------------------------------------------+-------------
          flat  flat%   sum%        cum   cum%   calls calls% + context              
    ----------------------------------------------------------+-------------
                                              528.17kB   100% |   google.golang.org/grpc/internal/transport.newHTTP2Client
             0     0%     0%   528.17kB 50.74%                | google.golang.org/grpc/internal/transport.newFramer
                                              528.17kB   100% |   bufio.NewReaderSize (inline)
    ----------------------------------------------------------+-------------
                                              528.17kB   100% |   google.golang.org/grpc/internal/transport.NewClientTransport
             0     0%     0%   528.17kB 50.74%                | google.golang.org/grpc/internal/transport.newHTTP2Client
                                              528.17kB   100% |   google.golang.org/grpc/internal/transport.newFramer
    ----------------------------------------------------------+-------------
    (pprof) 
    

      

             list         Output annotated source for functions matching regexp
             pdf          Outputs a graph in PDF format
             peek         Output callers/callees of functions matching regexp
  • 相关阅读:
    日志类
    sql查询数据并导出问题
    高并发系统设计(十七):【系统架构】微服务化后,系统架构要如何改造?
    高并发系统设计(十五):【消息队列】如何降低消息队列系统中消息的延迟?
    高并发系统设计(十四):【消息队列】如何消息不丢失?并且保证消息仅仅被消费一次?
    高并发系统设计(十三):消息队列的三大作用:削峰填谷、异步处理、模块解耦
    高并发系统设计(十二):【缓存的正确使用姿势】缓存穿透了怎么办?如何最大程度避免缓存穿透
    高并发系统设计(十一):【缓存的正确使用姿势】缓存如何做到高可用?
    ThinkPad X1 Carbon无法识别第二屏幕
    如何设置两个TPLink路由器桥接
  • 原文地址:https://www.cnblogs.com/rsapaper/p/11576687.html
Copyright © 2011-2022 走看看