zoukankan      html  css  js  c++  java
  • Nginx 报错 connect() failed (110: Connection timed out) while connecting to upstream

    转自

    作者:栈木头
    链接:https://www.jianshu.com/p/f0f05c02e93a

    背景
    在对应用服务进行压力测试时,Nginx在持续压测请求1min左右后开始报错,花了一些时间对报错的原因进行排查,并最终定位到问题,现将过程总结下。

    压测工具
    这里压测使用的是siege, 其非常容易指定并发访问数以及并发时间,以及有非常清晰的结果反馈,成功访问数,失败数,吞吐率等性能结果。

    压测指标
    单接口压测,并发100,持续1min。

    压测工具 报错

    The server is now under siege...
    [error] socket: unable to connect sock.c:249: Connection timed out
    [error] socket: unable to connect sock.c:249: Connection timed out
    

      

    Nginx error.log 报错

    2018/11/21 17:31:23 [error] 15622#0: *24993920 connect() failed (110: Connection timed out) while connecting to upstream, client: 192.168.xx.xx, server: xx-qa.xx.com, request: "GET /guide/v1/activities/1107 HTTP/1.1", upstream: "http://192.168.xx.xx:8082/xx/v1/activities/1107", host: "192.168.86.90"
    
    2018/11/21 18:21:09 [error] 4469#0: *25079420 connect() failed (110: Connection timed out) while connecting to upstream, client: 192.168.xx.xx, server: xx-qa.xx.com, request: "GET /guide/v1/activities/1107 HTTP/1.1", upstream: "http://192.168.xx.xx:8082/xx/v1/activities/1107", host: "192.168.86.90"
    

      

    排查问题

    1. 看到 timed out 第一感觉是,应用服务存在性能问题,导致并发请求时无法响应请求;通过排查应用服务的日志,发现其实应用服务并没有任何报错;

    2. 观察应用服务的CPU负载(Docker 容器 docker state id) ,发现其在并发请求时CPU使用率升高,再无其他异常,属于正常情况。不过持续观察发现,在压测报错开始后,应用服务所在的CPU负载降低,应用服务日志里也没有了请求日志,暂时可以判定无法响应请求应该来自应用服务链路的前一节点,也就是Nginx;

    3. 通过命令排查Nginx所在服务器,压测时的TCP连接情况

      # 查看当前80端口的连接数
      netstat -nat|grep -i "80"|wc -l
      5407
      
      # 查看当前TCP连接的状态
      netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
      LISTEN 12
      SYN_RECV 1
      ESTABLISHED 454
      FIN_WAIT1 1
      TIME_WAIT 5000
      

        

    发现在TCP的连接有两个异常点

    1. 竟然有5k多个连接
    2. TCP状态TIME_WAIT 到5000个后停止增长

    关于这两点开始进行分析:

    1. 理论上100个并发用户数压测,应该只有100个连接才对,造成这个原因应该是 siege 压测时创建了5000个连接

      # 查看siege配置
      vim ~/.siege/siege.conf
      
      # 真相大白,原来siege在压测时,连接默认是close,也就是说在持续压测时,每个请求结束后,直接关闭连接,然后再创建新的连接,那么就可以理解为什么压测时Nginx所在服务器TCP连接数5000多,而不是100;
      
      # Connection directive. Options "close" and "keep-alive" Starting with
      # version 2.57, siege implements persistent connections in accordance 
      # to RFC 2068 using both chunked encoding and content-length directives
      # to determine the page size. 
      #
      # To run siege with persistent connections set this to keep-alive. 
      #
      # CAUTION:        Use the keep-alive directive with care.
      # DOUBLE CAUTION: This directive does not work well on HPUX
      # TRIPLE CAUTION: We don't recommend you set this to keep-alive
      # ex: connection = close
      #     connection = keep-alive
      #
      connection = close
      

        

    2. TIME_WAIT 到5000分析,这要先弄清楚,TCP状态TIME_WAIT是什么含义

      TIME-WAIT:等待足够的时间以确保远程TCP接收到连接中断请求的确认;TCP要保证在所有可能的情况下使得所有的数据都能够被正确送达。当你关闭一个socket时,主动关闭一端的socket将进入TIME_WAIT状态,而被动关闭一方则转入CLOSED状态,这的确能够保证所有的数据都被传输。

    TIME-WAIT定义中分析得知,当压测工具关闭连接后,实际上Nginx所在机器连接并未立刻CLOSED,而是进入TIME-WAIT状态,网上可以搜到非常多讲解TIME-WAIT过多导致丢包的情况,与我在压测时所遇到情况一样。

    # 查看Nginx所在服务器的配置
    cat /etc/sysctl.conf 
    # sysctl settings are defined through files in
    # /usr/lib/sysctl.d/, /run/sysctl.d/, and /etc/sysctl.d/.
    #
    # Vendors settings live in /usr/lib/sysctl.d/.
    # To override a whole file, create a new file with the same in
    # /etc/sysctl.d/ and put new settings there. To override
    # only specific settings, add a file with a lexically later
    # name in /etc/sysctl.d/ and put new settings there.
    #
    # For more information, see sysctl.conf(5) and sysctl.d(5).
    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.lo.disable_ipv6 = 1
    
    vm.swappiness = 0
    net.ipv4.neigh.default.gc_stale_time=120
    
    
    # see details in https://help.aliyun.com/knowledge_detail/39428.html
    net.ipv4.conf.all.rp_filter=0
    net.ipv4.conf.default.rp_filter=0
    net.ipv4.conf.default.arp_announce = 2
    net.ipv4.conf.lo.arp_announce=2
    net.ipv4.conf.all.arp_announce=2
    
    
    # see details in https://help.aliyun.com/knowledge_detail/41334.html
    net.ipv4.tcp_max_tw_buckets = 5000
    net.ipv4.tcp_syncookies = 1
    net.ipv4.tcp_max_syn_backlog = 1024
    net.ipv4.tcp_synack_retries = 2
    kernel.sysrq = 1
    fs.file-max = 65535
    net.ipv4.ip_forward = 1
    net.ipv4.tcp_fin_timeout = 30
    net.ipv4.tcp_max_syn_backlog = 10240
    net.ipv4.tcp_keepalive_time = 1200
    net.ipv4.tcp_synack_retries = 3
    net.ipv4.tcp_syn_retries = 3
    net.ipv4.tcp_max_orphans = 8192
    net.ipv4.tcp_max_tw_buckets = 5000
    net.ipv4.tcp_window_scaling = 0
    net.ipv4.tcp_sack = 0
    net.ipv4.tcp_timestamps = 0
    net.ipv4.tcp_syncookies = 1
    net.ipv4.tcp_tw_reuse = 1
    net.ipv4.tcp_tw_recycle = 1
    net.ipv4.ip_local_port_range = 1024 65000
    net.ipv4.icmp_echo_ignore_all = 0
    net.ipv4.tcp_max_tw_buckets = 50005000表示系统同时保持TIME_WAIT套接字的最大数量,如果超过这个数字,TIME_WAIT套接字将立刻被清除并打印警告信息。

      

    优化方案
    参照在网上搜索获取的信息,调整Linux内核参数优化:

    net.ipv4.tcp_syncookies = 1 表示开启SYN Cookies。当出现SYN等待队列溢出时,启用cookies来处理,可防范少量SYN攻击,默认为0,表示关闭;
    
    net.ipv4.tcp_tw_reuse = 1 表示开启重用。允许将TIME-WAIT sockets重新用于新的TCP连接,默认为0,表示关闭;
    
    net.ipv4.tcp_tw_recycle = 1 表示开启TCP连接中TIME-WAIT sockets的快速回收,默认为0,表示关闭。
    
    net.ipv4.tcp_fin_timeout = 30 表示如果套接字由本端要求关闭,这个参数决定了它保持在FIN-WAIT-2状态的时间。
    
    net.ipv4.tcp_keepalive_time = 1200 表示当keepalive起用的时候,TCP发送keepalive消息的频度。缺省是2小时,改为20分钟。
    
    net.ipv4.ip_local_port_range = 1024 65000 表示用于向外连接的端口范围。缺省情况下很小:32768到61000,改为1024到65000。
    
    net.ipv4.tcp_max_syn_backlog = 8192 表示SYN队列的长度,默认为1024,加大队列长度为8192,可以容纳更多等待连接的网络连接数。
    
    net.ipv4.tcp_max_tw_buckets = 5000表示系统同时保持TIME_WAIT套接字的最大数量,如果超过这个数字,TIME_WAIT套接字将立刻被清除并打印警告信息。默认为180000,改为5000。
    

      




    当你的才华还撑不起你的野心的时候,你就应该静下心来学习; 当你的能力还驾驭不了你的目标时,就应该沉下心来历练。
  • 相关阅读:
    lrzsz on linux
    ASP.Net Core 运行在Linux(CentOS)
    ASP.Net Core 运行在Linux(Ubuntu)
    .Net程序跑在Linux上
    通过GitHub部署网站到Azure WebSite
    kubernetes报错
    第4篇创建harbor私有镜像库
    第1篇Kubernetes介绍
    第2篇Kubernetes架构
    第3篇K8S集群部署
  • 原文地址:https://www.cnblogs.com/ellisonzhang/p/15222022.html
Copyright © 2011-2022 走看看