Long-term TCP sessions & MPTCP

https://github.com/multipath-tcp/mptcp/issues/153

the following issue is fixed withhttps://github.com/multipath-tcp/mptcp/commit/133537deb63d04e1dfb5af7fd82ed51ba243e518

wapsicommentedon 20 Nov 2016 •

I'm using SSH port tunneling and MPTCP and I've noticed that after several hours or days the MPTCP stops working (traffic doesn't go thru all available interfaces / gateways anymore, it is using only one path). Restarting of this long-term / sustained SSH/TCP session fixes the issue and MPTCP "starts to work again".

What could cause this? Is there any way to debug this problem? Is there any way to tell to MPTCP to lookup new paths again or something similar? I see that under /proc//net/mptcp_net/ and /proc//net/mptcp_fullmesh there are some stats available but is there anything like echo 1 > /proc//net/mptcp_net/discover_paths_again like thing?

My MPTCP settings:
[ 0.412909] MPTCP: Stable release v0.91.2
kernel.osrelease = 4.1.35.mptcp
net.ipv4.tcp_allowed_congestion_control = lia reno cubic
net.ipv4.tcp_available_congestion_control = lia reno balia wvegas cubic olia
net.ipv4.tcp_congestion_control = lia (tried other ones too but the issue remains)
net.core.wmem_max = 115343360
net.core.rmem_max = 115343360
net.ipv4.tcp_rmem = 10240 87380 115343360
net.ipv4.tcp_wmem = 10240 87380 115343360
net.mptcp.mptcp_binder_gateways =
net.mptcp.mptcp_checksum = 0
net.mptcp.mptcp_debug = 0
net.mptcp.mptcp_enabled = 1
net.mptcp.mptcp_path_manager = fullmesh
net.mptcp.mptcp_scheduler = default
net.mptcp.mptcp_syn_retries = 10
net.mptcp.mptcp_version = 1

cpaasch added thequestion labelon 22 Nov 2016

Owner

cpaaschcommentedon 22 Nov 2016

Hello,

do you have a packet-trace of this behavior? It might be that you have a NAT on the path that is timing out.

titercommentedon 22 Nov 2016

Hi,

I see the same behavior here, using MPTCP to aggregate two DSL links. My local gateway is connected to both DSL routers (NAT'ed in both cases), and maintains a long-running, MPTCP-enabled OpenVPN connection to a relay router, through which traffic gets routed. I am fairly happy with that MPTCP setup btw, it has been running for a couple years now and proved effective at hiding glitches from either DSL, and aggregating bandwidth with about 90% efficiency.

Sometimes a subflow dies, e.g. after one of the DSL routers restarts and ends up with a new public IP address. My current workaround is to run a background task that detects that and bounces OpenVPN - but if there is a better way to handle it, I am interested.

Running Debian kernel 4.1.35.mptcp on both endpoints

Owner

cpaaschcommentedon 22 Nov 2016

Can you give the below patch a try? (didn't test it at all! just compiled ;))

You might have to tweak the sysctl's tcp_retries* to have a faster subflow timeout.
When loading the path-manager you have to set the module-parameter create_on_err to 1. Module parameters are in/sys/module/mptcp_fullmesh/parameters

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index cb5e4cf76b23..e66b8aa295ca 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -230,6 +230,7 @@ struct mptcp_pm_ops {
 	void (*release_sock)(struct sock *meta_sk);
 	void (*fully_established)(struct sock *meta_sk);
 	void (*new_remote_address)(struct sock *meta_sk);
+	void (*subflow_error)(struct sock *meta_sk, struct sock *sk);
 	int  (*get_local_id)(sa_family_t family, union inet_addr *addr,
 			     struct net *net, bool *low_prio);
 	void (*addr_signal)(struct sock *sk, unsigned *size,
diff --git a/net/mptcp/mptcp_ctrl.c b/net/mptcp/mptcp_ctrl.c
index 6045ba160225..853310cbc5d9 100644
--- a/net/mptcp/mptcp_ctrl.c
+++ b/net/mptcp/mptcp_ctrl.c
@@ -610,13 +610,13 @@ EXPORT_SYMBOL(mptcp_select_ack_sock);
 static void mptcp_sock_def_error_report(struct sock *sk)
 {
 	const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
+	struct sock *meta_sk = mptcp_meta_sk(sk);
 
 	if (!sock_flag(sk, SOCK_DEAD))
 		mptcp_sub_close(sk, 0);
 
 	if (mpcb->infinite_mapping_rcv || mpcb->infinite_mapping_snd ||
 	    mpcb->send_infinite_mapping) {
-		struct sock *meta_sk = mptcp_meta_sk(sk);
 
 		meta_sk->sk_err = sk->sk_err;
 		meta_sk->sk_err_soft = sk->sk_err_soft;
@@ -633,6 +633,9 @@ static void mptcp_sock_def_error_report(struct sock *sk)
 			tcp_done(meta_sk);
 	}
 
+	if (mpcb->pm_ops->subflow_error)
+		mpcb->pm_ops->subflow_error(meta_sk, sk);
+
 	sk->sk_err = 0;
 	return;
 }
diff --git a/net/mptcp/mptcp_fullmesh.c b/net/mptcp/mptcp_fullmesh.c
index 71eb2d4ad2d4..61fda6e1be3e 100644
--- a/net/mptcp/mptcp_fullmesh.c
+++ b/net/mptcp/mptcp_fullmesh.c
@@ -95,6 +95,10 @@ static int num_subflows __read_mostly = 1;
 module_param(num_subflows, int, 0644);
 MODULE_PARM_DESC(num_subflows, "choose the number of subflows per pair of IP addresses of MPTCP connection");
 
+static int create_on_err __read_mostly = 0;
+module_param(create_on_err, int, 0644);
+MODULE_PARM_DESC(create_on_err, "recreate the subflow upon a timeout");
+
 static struct mptcp_pm_ops full_mesh __read_mostly;
 
 static void full_mesh_create_subflows(struct sock *meta_sk);
@@ -1370,6 +1374,24 @@ static void full_mesh_create_subflows(struct sock *meta_sk)
 	}
 }
 
+static void full_mesh_subflow_error(struct sock *meta_sk, struct sock *sk)
+{
+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
+
+	if (!create_on_err)
+		return;
+
+	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
+	    mpcb->send_infinite_mapping ||
+	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
+		return;
+
+	if (sk->sk_err != ETIMEDOUT)
+		return;
+
+	full_mesh_create_subflows(meta_sk);
+}
+
 /* Called upon release_sock, if the socket was owned by the user during
  * a path-management event.
  */
@@ -1799,6 +1821,7 @@ static struct mptcp_pm_ops full_mesh __read_mostly = {
 	.release_sock = full_mesh_release_sock,
 	.fully_established = full_mesh_create_subflows,
 	.new_remote_address = full_mesh_create_subflows,
+	.subflow_error = full_mesh_subflow_error,
 	.get_local_id = full_mesh_get_local_id,
 	.addr_signal = full_mesh_addr_signal,
 	.add_raddr = full_mesh_add_raddr,

wapsicommentedon 22 Nov 2016 •

Hmmm... Packet trace is quite difficult to take from because it could take 1 hour or 2 days when this happens. The capture file will be HUGE...

Yes, I've NAT between these MPTCP boxes. Here are the NAT TCP timeout settings (it's a Linux box):

[root@firewall ~]# sysctl -a|grep conntrack_tcp_timeout
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = **432000**
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300

And I've opened SSH tunnel using settings ServerAliveInterval 10 & ServerAliveCountMax 3 at the client side and ClientAliveInterval 10, ClientAliveCountMax 3 & TCPKeepAlive yes settings at the server side so if I understand those settings correct they should avoid TCP timeout issues.

Here are some stats from netstat commands (I've 3 gateways and 3 sublows an I exclude ^mptcp connections from this list because I want to list subflows):

$ netstat -n|grep " (SSH Server IP):(SSH server port) "|grep ^tcp|grep ESTABLISHED$|grep -c " (eth0 IP):"
9
$ netstat -n|grep " (SSH Server IP):(SSH server port) "|grep ^tcp|grep ESTABLISHED$|grep -c " (eth1 IP):"
9
$ netstat -n|grep " (SSH Server IP):(SSH server port) "|grep ^tcp|grep ESTABLISHED$|grep -c " (eth2 IP):"
9

And after several hours those are something like:

$ netstat -n|grep " (SSH Server IP):(SSH server port) "|grep ^tcp|grep ESTABLISHED$|grep -c " (eth0 IP):"
7
$ netstat -n|grep " (SSH Server IP):(SSH server port) "|grep ^tcp|grep ESTABLISHED$|grep -c " (eth1 IP):"
5
$ netstat -n|grep " (SSH Server IP):(SSH server port) "|grep ^tcp|grep ESTABLISHED$|grep -c " (eth2 IP):"
4

So some of the subflows have really dropped out. Now if I restart SSH sessions all the TCP subflows get established again. Another approach is run following commands:

$ ip link set dev eth0 multipath off ; sleep 1 ; ip link set dev eth0 multipath on
$ ip link set dev eth1 multipath off ; sleep 1 ; ip link set dev eth1 multipath on
$ ip link set dev eth2 multipath off ; sleep 1 ; ip link set dev eth2 multipath on

And then new subflows are established again using all available gateways.

Update: I'll try with the patch you just sent.

Owner

cpaaschcommentedon 22 Nov 2016

The keepalives unfortunately are not a safe solution (in today's implementation of MPTCP in Linux). Because, we chose to only keep the MPTCP-connection alive. Meaning TCP keepalives are sent at most on one single subflow. Thus, the other subflow is timing out.

The keepalive handling is probably something we should rethink.

wapsicommentedon 22 Nov 2016

Just tested with your patch applied and create_on_err parameter set:

$ cat /sys/module/mptcp_fullmesh/parameters/create_on_err
1

sysctl mptcp settings used:

net.mptcp.mptcp_binder_gateways = 
net.mptcp.mptcp_checksum = 0
net.mptcp.mptcp_debug = 0
net.mptcp.mptcp_enabled = 1
net.mptcp.mptcp_path_manager = fullmesh
net.mptcp.mptcp_scheduler = default
net.mptcp.mptcp_syn_retries = 10
net.mptcp.mptcp_version = 1

and still some TCP subflows will be disconnected (after ~5 hours). Again if I run:

$ ip link set dev eth0 multipath off ; sleep 1 ; ip link set dev eth0 multipath on
$ ip link set dev eth1 multipath off ; sleep 1 ; ip link set dev eth1 multipath on
$ ip link set dev eth2 multipath off ; sleep 1 ; ip link set dev eth2 multipath on

the situation will be fixed and new subflows will be opened using all available gateways.

I used your patch only at the client's side. I assumed that it is necessary only on there.

Owner

cpaaschcommentedon 23 Nov 2016

Yes, it's only needed on the client's side.

You should also change the tcp-retries sysctl's to have faster timeouts:

sysctl -w net.ipv4.tcp_retries2=3

Please also take a packet-trace to see if you really get timeout.

Owner

cpaaschcommentedon 29 Nov 2016

@wapsi &@titer - Do you have an update?

cpaasch addedenhancement and removedquestion labelson 29 Nov 2016

wapsicommentedon 29 Nov 2016 •

I'm not able to get valid packet-trace atm. If I try to do this with tcpdump the cap file gets so huge before first subflow drops that it doesn't fit in my MPTCP router's HDD. Do you have any tips how to do this "sensibly"?

Owner

cpaaschcommentedon 2 Dec 2016

@wapsi You can limit the size of the packet-trace with the option-s 150. Then, if that's not enough, add-C 100 -W 10 -w capture. This limits the file-size to 100MB and overwrites the original when rotating.

djbobocommentedon 18 Dec 2016 •

Hi,
I'm seeing the same behavior on long running tcp connections.

Server with two interfaces - public (routable) ipv4
Client with two interfaces - masqueraded ipv4

Initial connection originates from the client (behind nat), full mesh is established as expected, in this case 2x2. However after some time one drops and it remains in 3 sub-flows.

Using OpenVPN (tcp) instead of ssh.

Using your debian kernel build https://dl.bintray.com/cpaasch/deb
Building kernel and will get back.

Owner

cpaaschcommentedon 19 Dec 2016

It would be good if someone can test the patch and confirm whether it really solves the problem.

djbobocommentedon 19 Dec 2016

I'm running the patched kernel for 9 hours.
I'd wait a little bit longer before I confirm.

Everything looks good so far.

cpaasch referenced this issueon 25 Jan

Closed

How many connections is it using?#162

matttbe added a commit that referenced this issueon 1 Feb

mptcp:
 Recreate subflows after a timeout

133537d

Owner

cpaaschcommentedon 10 Feb

Fixed with 133537d

Long-term TCP sessions & MPTCP

wapsicommentedon 20 Nov 2016 •

cpaasch added thequestion labelon 22 Nov 2016

cpaaschcommentedon 22 Nov 2016

titercommentedon 22 Nov 2016

cpaaschcommentedon 22 Nov 2016

wapsicommentedon 22 Nov 2016 •

cpaaschcommentedon 22 Nov 2016

wapsicommentedon 22 Nov 2016

cpaaschcommentedon 23 Nov 2016

cpaaschcommentedon 29 Nov 2016

cpaasch addedenhancement and removedquestion labelson 29 Nov 2016

wapsicommentedon 29 Nov 2016 •

cpaaschcommentedon 2 Dec 2016

djbobocommentedon 18 Dec 2016 •

cpaaschcommentedon 19 Dec 2016

djbobocommentedon 19 Dec 2016

cpaasch referenced this issueon 25 Jan

How many connections is it using?#162

matttbe added a commit that referenced this issueon 1 Feb

cpaaschcommentedon 10 Feb

cpaasch closed thison 10 Feb