zoukankan      html  css  js  c++  java
  • nginx cpu高排查

    首先查看nginx的error日志,无异常打印。

    cpu占用如下图所示:

    top - 10:05:40 up 233 days, 16:28, 4 users, load average: 25.53, 25.39, 21.62
    Tasks: 836 total, 20 running, 816 sleeping, 0 stopped, 0 zombie
    Cpu(s): 51.8%us, 4.1%sy, 0.0%ni, 31.0%id, 6.1%wa, 0.0%hi, 6.9%si, 0.0%st
    Mem: 257940M total, 254533M used, 3407M free, 6481M buffers
    Swap: 0M total, 0M used, 0M free, 190273M cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    16645 root 20 0 15.6g 7.2g 7.1g R 99 2.9 19888:06 nginx
    16643 root 20 0 15.6g 7.2g 7.1g R 98 2.9 19917:31 nginx
    16641 root 20 0 15.6g 7.2g 7.1g R 98 2.9 19547:57 nginx
    16661 root 20 0 15.6g 7.2g 7.1g R 97 2.9 19957:15 nginx
    16659 root 20 0 15.6g 7.2g 7.1g R 97 2.9 20032:33 nginx
    16637 root 20 0 15.6g 7.2g 7.1g R 96 2.9 21085:44 nginx
    16640 root 20 0 15.6g 7.2g 7.1g R 96 2.9 16897:32 nginx
    16644 root 20 0 15.6g 7.2g 7.1g R 96 2.9 19860:42 nginx
    16638 root 20 0 15.6g 7.2g 7.1g R 96 2.9 16362:11 nginx
    16648 root 20 0 15.6g 7.2g 7.1g R 96 2.9 16627:30 nginx
    16669 root 20 0 15.6g 7.2g 7.1g R 96 2.9 17290:48 nginx
    16653 root 20 0 15.6g 7.2g 7.1g R 96 2.9 17093:40 nginx
    16656 root 20 0 15.6g 7.2g 7.1g R 96 2.9 19544:29 nginx

    nginx的cpu占用很高,且处于用户态占用。单独查看16645进程的cpu占用如下:

    top -H -p 16645

    top - 10:06:09 up 233 days, 16:28, 4 users, load average: 26.95, 25.73, 21.83
    Tasks: 68 total, 1 running, 67 sleeping, 0 stopped, 0 zombie
    Cpu(s): 8.0%us, 5.5%sy, 0.0%ni, 63.8%id, 13.0%wa, 0.0%hi, 9.7%si, 0.0%st
    Mem: 257940M total, 254387M used, 3552M free, 6481M buffers
    Swap: 0M total, 0M used, 0M free, 190106M cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    16645 root 20 0 15.6g 7.2g 7.1g R 94 2.9 15689:54 nginx
    18749 root 20 0 15.6g 7.2g 7.1g S 0 2.9 42:31.63 nginx
    18751 root 20 0 15.6g 7.2g 7.1g S 0 2.9 63:29.10 nginx
    18753 root 20 0 15.6g 7.2g 7.1g S 0 2.9 64:42.03 nginx
    18755 root 20 0 15.6g 7.2g 7.1g S 0 2.9 65:18.44 nginx
    18757 root 20 0 15.6g 7.2g 7.1g S 0 2.9 65:02.60 nginx

    还好,是单独一个线程cpu高,以前遇到复杂的,是cpu各个线程之间跳着高。

    想用perf看一下热点函数:

    perf top -p 16645
    If 'perf' is not a typo you can run the following command to lookup the package that contains the binary:
    command-not-found perf
    -bash: perf: command not found

    发现命令没装,由于是线上环境,出现一次也不容易,所以只能attch上去手工采样,看这个线程在忙什么。

    由于是多线程的进程,所以要确定线程号,然后看堆栈,info thr 确定线程号之后,切换到对应的堆栈,然后bt查看堆栈。

    部分堆栈如下:

    (gdb) info thr
    Id Target Id Frame
    68 Thread 0x7f4ef02e2700 (LWP 17129) "nginx" 0x00007f5154c1666c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
    67 Thread 0x7f4eefae1700 (LWP 17132) "nginx" 0x00007f5154c1666c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

    ......................//省略中间部分

    2 Thread 0x7f4d71635700 (LWP 19135) "nginx" 0x00007f5154015b1d in nanosleep () from /lib64/libc.so.6
    * 1 Thread 0x7f5155419720 (LWP 16645) "nginx" 0x00007f51547b4ac0 in ?? () from /usr/local/lib/libluajit-5.1.so.2

    (gdb) thr 1   --------因为我们cpu高的就是16645线程,所以切换到1
    [Switching to thread 1 (Thread 0x7f5155419720 (LWP 16645))]

    对应的堆栈如下:

    (gdb) bt
    #0 0x00007f51547b47e7 in ?? () from /usr/local/lib/libluajit-5.1.so.2
    #1 0x00007f51547b4865 in ?? () from /usr/local/lib/libluajit-5.1.so.2
    #2 0x00007f51547b4af4 in ?? () from /usr/local/lib/libluajit-5.1.so.2
    #3 0x00007f51547b4c17 in ?? () from /usr/local/lib/libluajit-5.1.so.2
    #4 0x00007f51547b489f in ?? () from /usr/local/lib/libluajit-5.1.so.2
    #5 0x00007f51547b54aa in ?? () from /usr/local/lib/libluajit-5.1.so.2
    #6 0x00007f515476c4eb in ?? () from /usr/local/lib/libluajit-5.1.so.2
    #7 0x00000000005400ab in ngx_http_lua_run_thread (L=0x417ac378, r=0xc26fd0, ctx=0xed1800, nrets=1) at ./required/lua-nginx-module-0.9.19/src/ngx_http_lua_util.c:1015
    #8 0x000000000052636a in ngx_http_lua_subrequest_resume (r=0xc26fd0) at ./required/lua-nginx-module-0.9.19/src/ngx_http_lua_subrequest.c:1595
    #9 0x0000000000545b0f in ngx_http_lua_content_wev_handler (r=0xc26fd0) at ./required/lua-nginx-module-0.9.19/src/ngx_http_lua_contentby.c:131

    由于用户态采样毕竟频率过低,最好要进行多次采样确定地址,连续多次采样,确认堆栈不变。

    ngx_http_lua_run_thread 是nginx调用的lua第三方模块,根据堆栈r=0xc26fd0,我们可以获取到如下信息:

    (gdb) p ((ngx_http_request_t*)0xc1cab0).request_line.data
    $22 = (
    u_char *) 0x1a4b5e0 "GET /001/2/ch00000090990000001073/2560000.m3u8?"...

    很可惜,url打印不全,这个是由于默认情况下,gdb会根据性能要求,打印一定长度的字符串,修改下gdb的字符串打印配置如下:

    (gdb) set print element 0           -----------这个0就是全部打印,如果是200,就默认打印200个字符,前提是没有遇到

    比如设置为1,则打印是如下:

    (gdb) set print element 1
    (gdb) p ((ngx_http_request_t*)0xc1cab0).request_line.data
    $28 = (u_char *) 0x1a4b5e0 "G"...

    重新设置为0,打印的数据涉及到保密,就不展示了,反正获取到了url,接下来就轻松了,curl请求该url,cpu立刻飙升至100,完美复现。

     后面的分析就不重要了,gdb跟一把ok。

    当然,有同事问到luajit后面的堆栈怎么处理,一般我是如下处理的:

    首先,获取对应pid的maps,然后看对应的库的的重定向起始地址,找到可执行x的那一行,

    7f5154763000-7f51547ce000 r-xp 00000000 08:02 294939 /usr/local/lib/libluajit-5.1.so.2.0.2--------------------7f5154763000为重定向的地址。
    7f51547ce000-7f51549cd000 ---p 0006b000 08:02 294939 /usr/local/lib/libluajit-5.1.so.2.0.2
    7f51549cd000-7f51549cf000 r--p 0006a000 08:02 294939 /usr/local/lib/libluajit-5.1.so.2.0.2
    7f51549cf000-7f51549d0000 rw-p 0006c000 08:02 294939 /usr/local/lib/libluajit-5.1.so.2.0.2

    然后将bt打印的地址减去7f5154763000,获取到偏移地址,

    0x00007f51547b47e7 -0x7f5154763000 = 0x517E7

    反汇编  /usr/local/lib/libluajit-5.1.so.2.0.2,objdump -D /usr/local/lib/libluajit-5.1.so.2.0.2 >caq.txt

    然后grep -i 517E7 caq.txt 找到对应的行号即可,我们这个问题的最终原因是因为m3u8文件内容异常,导致一直在处理lua的字符串。

    还有些人会喜欢用addr2line 来获取行号,不过我们编译的so文件不带符号表,获取出来是?。感兴趣的可以自己试一下。

    水平有限,如果有错误,请帮忙提醒我。如果您觉得本文对您有帮助,可以点击下面的 推荐 支持一下我。版权所有,需要转发请带上本文源地址,博客一直在更新,欢迎 关注 。
  • 相关阅读:
    如何实现分页功能
    学习Python的心路历程
    Python基础---协程
    Python基础---线程
    Python基础---python中的进程操作
    Python基础---进程相关基础
    Python基础---并发编程(操作系统的发展史)
    Python基础---网络编程3
    Python基础---网络编程2
    Python基础---面向对象3
  • 原文地址:https://www.cnblogs.com/10087622blog/p/7443079.html
Copyright © 2011-2022 走看看