[关闭]
@zhongdao 2019-04-14T10:27:54.000000Z 字数 8256 阅读 46055

CLOSE_WAIT连接过多的现象分析与处理

未分类


1. CLOSE_WAIT的机制和原理一.

来自参考资料:从问题看本质: 研究TCP close_wait的内幕

客户端主动发起 socket.close时

假设我们有一个client, 一个server.
当client主动发起一个socket.close()这个时候对应TCP来说,会发生什么事情呢?如下图所示.

image_1d8344ian14sr1r7m115t1u362m69.png-25kB

client首先发送一个FIN信号给server, 这个时候client变成了FIN_WAIT_1的状态, server端收到FIN之后,返回ACK,然后server端的状态变成了CLOSE_WAIT.

接着server端需要发送一个FIN给client,然后server端的状态变成了LAST_ACK,接着client返回一个ACK,然后server端的socket就被成功的关闭了.

从这里可以看到,如果由客户端主动关闭一链接,那么客户端是不会出现CLOSE_WAIT状态的.客户端主动关闭链接,那么Server端将会出现CLOSE_WAIT的状态.

服务器端主动发起 socket.close 时

那么当server主动发起一个socket.close(),这个时候又发生了一些什么事情呢.

image_1d834imritoi75o3i019r671dm.png-26.4kB

从图中我们可以看到,如果是server主动关闭链接,那么Client则有可能进入CLOSE_WAIT,如果Client不发送FIN包,那么client就一直会处在CLOSE_WAIT状态(后面我们可以看到有参数可以调整这个时间).

结论

谁主动关闭链接,则对方则可能进入CLOSE_WAIT状态,除非对方达到超时时间,主动关闭。

服务器端的设置

如果我们的tomcat既服务于浏览器,又服务于其他的APP,而且我们把connection的keep-alive时间设置为10分钟,那么带来的后果是浏览器打开一个页面,然后这个页面一直不关闭,那么服务器上的socket也不能关闭,它所占用的FD也不能服务于其他请求.如果并发一高,很快服务器的资源将会被耗尽.新的请求再也进不来. 那么如果把keep-alive的时间设置的短一点呢,比如15s? 那么其他的APP来访问这个服务器的时候,一旦这个socket, 15s之内没有新的请求,那么客户端APP的socket将出现大量的CLOSE_WAIT状态.

所以如果出现这种情况,建议将你的server分开部署,服务于browser的部署到单独的JVM实例上,保持keep-alive为15s,而服务于架构中其他应用的功能部署到另外的JVM实例中,并且将keep-alive的时间设置的更

长,比如说1个小时.这样客户端APP建立的connection,如果在一个小时之内都没有重用这条connection,那么客户端的socket才会进入CLOSE_WAIT的状态.针对不同的应用场景来设置不同的keep-alive时间,可以帮助我们提高程序的性能.

2. CLOSE_WAIT的机制和原理二(附实例代码)

来自参考资料:
This is strictly a violation of the TCP specification
TCP: About FIN_WAIT_2, TIME_WAIT and CLOSE_WAIT

产生机制

image_1d836ssf41a3h193o1s2k1bu519m413.png-75kB

image_1d83al3361vhh4fc10anoiggfl1g.png-45.8kB

Time to raise the curtain of doubt. Here is what happens.

The listening application leaks sockets, they are stuck in CLOSE_WAIT TCP state forever. These sockets look like (127.0.0.1:5000, 127.0.0.1:some-port). The client socket at the other end of the connection is (127.0.0.1:some-port, 127.0.0.1:5000), and is properly closed and cleaned up.

When the client application quits, the (127.0.0.1:some-port, 127.0.0.1:5000) socket enters the FIN_WAIT_1 state and then quickly transitions to FIN_WAIT_2. The FIN_WAIT_2 state should move on to TIME_WAIT if the client received FIN packet, but this never happens. The FIN_WAIT_2 eventually times out. On Linux this is 60 seconds, controlled by net.ipv4.tcp_fin_timeout sysctl.

This is where the problem starts. The (127.0.0.1:5000, 127.0.0.1:some-port) socket is still in CLOSE_WAIT state, while (127.0.0.1:some-port, 127.0.0.1:5000) has been cleaned up and is ready to be reused. When this happens the result is a total mess. One part of the socket won't be able to advance from the SYN_SENT state, while the other part is stuck in CLOSE_WAIT. The SYN_SENT socket will eventually give up failing with ETIMEDOUT.

  1. sysctl -a |grep ipv4 |grep timeout
  2. kernel.hung_task_timeout_secs = 120
  3. net.ipv4.route.gc_timeout = 300
  4. net.ipv4.tcp_fin_timeout = 60
  5. net.ipv4.tcp_thin_linear_timeouts = 0

实例问题代码

  1. // This is a trivial TCP server leaking sockets.
  2. package main
  3. import (
  4. "fmt"
  5. "net"
  6. "time"
  7. )
  8. func handle(conn net.Conn) {
  9. defer conn.Close()
  10. for {
  11. time.Sleep(time.Second)
  12. }
  13. }
  14. func main() {
  15. IP := ""
  16. Port := 5000
  17. listener, err := net.Listen("tcp4", fmt.Sprintf("%s:%d", IP, Port))
  18. if err != nil {
  19. panic(err)
  20. }
  21. i := 0
  22. for {
  23. if conn, err := listener.Accept(); err == nil {
  24. i += 1
  25. if i < 800 {
  26. go handle(conn)
  27. } else {
  28. conn.Close()
  29. }
  30. } else {
  31. panic(err)
  32. }
  33. }
  34. }

重现CLOSE_WAIT

启动服务端

  1. # go build listener.go && ./listener &
  2. # ss -n4tpl 'sport = :5000'
  3. State Recv-Q Send-Q Local Address:Port Peer Address:Port
  4. LISTEN 0 128 *:5000 *:* users:(("listener",pid=15158,fd=3))

启动客户端,用nc

  1. ss -n4tpl 'sport = :5000'
  2. State Recv-Q Send-Q Local Address:Port Peer Address:Port
  3. LISTEN 0 128 *:5000 *:* users:(("listener",pid=15158,fd=3))
  4. ESTAB 0 0 127.0.0.1:5000 127.0.0.1:47810 users:(("listener",pid=15158,fd=5))

可以看到启动了一个socket连接,客户端端口是47810.

杀死客户端

  1. kill `pidof nc`

服务端连接进入CLOSE_WAIT.

  1. ss -n4tp |grep 5000
  2. CLOSE-WAIT 1 0 127.0.0.1:5000 127.0.0.1:47810 users:(("listener",pid=15158,fd=5))

TCP设计说明

It seems that the design decisions made by the BSD Socket API have unexpected long lasting consequences. If you think about it - why exactly the socket can automatically expire the FIN_WAIT state, but can't move off from CLOSE_WAIT after some grace time. This is very confusing... And it should be! The original TCP specification does not allow automatic state transition after FIN_WAIT_2 state! According to the spec FIN_WAIT_2 is supposed to stay running until the application on the other side cleans up.

Let me leave you with the tcp(7) manpage describing the tcp_fin_timeout setting:

  1. tcp_fin_timeout (integer; default: 60)
  2. This specifies how many seconds to wait for a final FIN packet before the socket is forcibly closed. This is strictly a violation of the TCP specification, but required to prevent
  3. denial-of-service attacks.

I think now we understand why automatically closing FIN_WAIT_2 is strictly speaking a violation of the TCP specification.

3. CLOSE_WAIT 处理说明

如果您发现与给定进程相关的连接往往总是处于CLOSE_WAIT状态,则意味着此进程在被动关闭后不执行活动关闭。编写通过TCP进行通信的程序时,应检测远程主机何时关闭连接并正确关闭套接字。如果您未能执行此操作,则套接字将保留在CLOSE_WAIT中,直到进程本身消失。

所以基本上,CLOSE_WAIT意味着操作系统知道远程应用程序已关闭连接并等待本地应用程序也这样做。因此,您不应尝试调整TCP参数来解决此问题,但请检查拥有本地主机上的连接的应用程序。由于没有CLOSE_WAIT超时,连接可以永远保持这种状态(或者至少在程序最终关闭连接或进程存在或被杀死之前)。

如果您无法修复应用程序或修复它,解决方案是终止打开连接的进程。当然,由于本地端点仍然可以在缓冲区中发送数据,因此仍然存在丢失数据的风险。此外,如果许多应用程序在同一进程中运行(就像Java Enterprise应用程序的情况一样),那么终止拥有进程并不总是一种选择。

我没有尝试使用tcpkill,killcx或者cutter强制关闭CLOSE_WAIT连接但是如果你不能杀死或重启持有连接的进程,那么它可能是一个选项。

4. 查看CLOSE_WAIT的ip与端口连接对

  1. netstat -tulnap | grep CLOSE_WAIT | sed -e 's/::ffff://g' | awk '{print $4,$5}' | sed 's/:/ /g'

结果举例:

  1. 172.26.59.197 8088 54.241.136.34 44690
  2. 172.26.59.197 8088 171.48.17.77 47220
  3. 172.26.59.197 8088 54.241.136.34 57828
  4. 172.26.59.197 8088 157.230.119.239 55920
  5. 172.26.59.197 8088 157.230.119.239 59650
  6. 172.26.59.197 8088 157.230.119.239 44418
  7. 172.26.59.197 8088 157.230.119.239 47634
  8. 172.26.59.197 8088 157.230.119.239 34940

每一行是一对CLOSE_WAIT的socket连接。示例是服务器端的连接。

5. 杀死 CLOSE_WAIT的perl代码

源代码:
https://github.com/rghose/kill-close-wait-connections/blob/master/kill_close_wait_connections.pl

  1. apt-get install libnet-rawip-perl libnet-pcap-perl libnetpacket-perl
  2. git clone https://github.com/rghose/kill-close-wait-connections.git
  3. cd kill-close-wait-connections
  4. mv kill_close_wait_connections.pl /usr/bin/kill_close_wait_connections
  5. chmod +x /usr/bin/kill_close_wait_connections

已经将其代码放置 http://39.106.122.67/ctorrent/kill_close_wait_connections.pl
不必通过git下载。

ubuntu 准备

  1. apt-get install libnet-rawip-perl libnet-pcap-perl libnetpacket-perl

CentOS准备

  1. yum -y install perl-Net-Pcap libpcap-devel perl-NetPacket
  2. curl -L http://cpanmin.us | perl - --sudo App::cpanminus
  3. cpanm Net::RawIP
  4. cpanm Net::Pcap
  5. cpanm NetPacket

安装

  1. wget http://39.106.122.67/ctorrent/kill_close_wait_connections.pl
  2. mv kill_close_wait_connections.pl /usr/bin/kill_close_wait_connections
  3. chmod +x /usr/bin/kill_close_wait_connections

执行

  1. kill_close_wait_connections

6. 杀死tcp 的其他命令与说明

资料1来源

Kill an active TCP connection
https://gist.github.com/amcorreia/10204572

Kill an active TCP connection内容

Some notes on killing a TCP connection...

Info gathering

(remember to be root!)

Motivations

CLOSE_WAIT related

资料2来源

Kill tcp connection with tcpkill on CentOS
https://gist.github.com/vdw/09efee4f264bb2630345

Kill tcp connection with tcpkill on CentOS 内容

Install tcpkill

  1. yum -y install dsniff --enablerepo=epel

View connections

  1. netstat -tnpa | grep ESTABLISHED.*sshd.

Block with ip tables

  1. iptables -A INPUT -s IP-ADDRESS -j DROP

Kill connection

  1. tcpkill -i eth0 -9 port 50185

Block brute forcing - iptables rules

  1. iptables -L -n
  2. iptables -I INPUT -p tcp --dport 22 -i eth0 -m state --state NEW -m recent --set
  3. iptables -I INPUT -p tcp --dport 22 -i eth0 -m state --state NEW -m recent --update --seconds 600 --hitcount 3 -j DROP
  4. iptables -A INPUT -p tcp --dport 22 -m state --state NEW -m recent --set --name ssh --rsource
  5. iptables -A INPUT -p tcp --dport 22 -m state --state NEW -m recent ! --rcheck --seconds 600 --hitcount 3 --name ssh --rsource -j ACCEPT
  6. service iptables save
  7. service iptables restart

7. 参考资料

从问题看本质: 研究TCP close_wait的内幕
https://www.cnblogs.com/zengkefu/p/5655016.html

This is strictly a violation of the TCP specification
https://blog.cloudflare.com/this-is-strictly-a-violation-of-the-tcp-specification/

https://github.com/cloudflare/cloudflare-blog/blob/master/2016-08-time-out/listener.go

TCP: About FIN_WAIT_2, TIME_WAIT and CLOSE_WAIT
https://benohead.com/tcp-about-fin_wait_2-time_wait-and-close_wait/

http://rahul-ghose.blogspot.com/2014/11/removing-closewait-connections.html

kill-close-wait-connections
https://github.com/rghose/kill-close-wait-connections

Kill an active TCP connection
https://gist.github.com/amcorreia/10204572

[命令行] curl查询公网出口IP
https://blog.csdn.net/orangleliu/article/details/51994513

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注