https://github.com/multipath-tcp/mptcp/issues/153
the following issue is fixed withhttps://github.com/multipath-tcp/mptcp/commit/133537deb63d04e1dfb5af7fd82ed51ba243e518
wapsicommentedon 20 Nov 2016 •
cpaasch added thequestion labelon 22 Nov 2016
cpaaschcommentedon 22 Nov 2016
Hello, do you have a packet-trace of this behavior? It might be that you have a NAT on the path that is timing out. |
titercommentedon 22 Nov 2016
Hi, I see the same behavior here, using MPTCP to aggregate two DSL links. My local gateway is connected to both DSL routers (NAT'ed in both cases), and maintains a long-running, MPTCP-enabled OpenVPN connection to a relay router, through which traffic gets routed. I am fairly happy with that MPTCP setup btw, it has been running for a couple years now and proved effective at hiding glitches from either DSL, and aggregating bandwidth with about 90% efficiency. Sometimes a subflow dies, e.g. after one of the DSL routers restarts and ends up with a new public IP address. My current workaround is to run a background task that detects that and bounces OpenVPN - but if there is a better way to handle it, I am interested. Running Debian kernel 4.1.35.mptcp on both endpoints |
cpaaschcommentedon 22 Nov 2016
Can you give the below patch a try? (didn't test it at all! just compiled ;)) You might have to tweak the sysctl's tcp_retries* to have a faster subflow timeout.
|
wapsicommentedon 22 Nov 2016 •
Hmmm... Packet trace is quite difficult to take from because it could take 1 hour or 2 days when this happens. The capture file will be HUGE... Yes, I've NAT between these MPTCP boxes. Here are the NAT TCP timeout settings (it's a Linux box):
And I've opened SSH tunnel using settings ServerAliveInterval 10 & ServerAliveCountMax 3 at the client side and ClientAliveInterval 10, ClientAliveCountMax 3 & TCPKeepAlive yes settings at the server side so if I understand those settings correct they should avoid TCP timeout issues. Here are some stats from netstat commands (I've 3 gateways and 3 sublows an I exclude ^mptcp connections from this list because I want to list subflows):
And after several hours those are something like:
So some of the subflows have really dropped out. Now if I restart SSH sessions all the TCP subflows get established again. Another approach is run following commands:
And then new subflows are established again using all available gateways. Update: I'll try with the patch you just sent. |
cpaaschcommentedon 22 Nov 2016
The keepalives unfortunately are not a safe solution (in today's implementation of MPTCP in Linux). Because, we chose to only keep the MPTCP-connection alive. Meaning TCP keepalives are sent at most on one single subflow. Thus, the other subflow is timing out. The keepalive handling is probably something we should rethink. |
wapsicommentedon 22 Nov 2016
Just tested with your patch applied and create_on_err parameter set:
sysctl mptcp settings used:
and still some TCP subflows will be disconnected (after ~5 hours). Again if I run:
the situation will be fixed and new subflows will be opened using all available gateways. I used your patch only at the client's side. I assumed that it is necessary only on there. |
cpaaschcommentedon 23 Nov 2016
Yes, it's only needed on the client's side. You should also change the tcp-retries sysctl's to have faster timeouts:
Please also take a packet-trace to see if you really get timeout. |
cpaaschcommentedon 29 Nov 2016
cpaasch addedenhancement and removedquestion labelson 29 Nov 2016
wapsicommentedon 29 Nov 2016 •
I'm not able to get valid packet-trace atm. If I try to do this with tcpdump the cap file gets so huge before first subflow drops that it doesn't fit in my MPTCP router's HDD. Do you have any tips how to do this "sensibly"? |
cpaaschcommentedon 2 Dec 2016
@wapsi You can limit the size of the packet-trace with the option |
djbobocommentedon 18 Dec 2016 •
Hi, Server with two interfaces - public (routable) ipv4 Initial connection originates from the client (behind nat), full mesh is established as expected, in this case 2x2. However after some time one drops and it remains in 3 sub-flows. Using OpenVPN (tcp) instead of ssh. Using your debian kernel build
https://dl.bintray.com/cpaasch/deb |
cpaaschcommentedon 19 Dec 2016
It would be good if someone can test the patch and confirm whether it really solves the problem. |
djbobocommentedon 19 Dec 2016
I'm running the patched kernel for 9 hours. Everything looks good so far. |
I'm using SSH port tunneling and MPTCP and I've noticed that after several hours or days the MPTCP stops working (traffic doesn't go thru all available interfaces / gateways anymore, it is using only one path). Restarting of this long-term / sustained SSH/TCP session fixes the issue and MPTCP "starts to work again".
What could cause this? Is there any way to debug this problem? Is there any way to tell to MPTCP to lookup new paths again or something similar? I see that under /proc//net/mptcp_net/ and /proc//net/mptcp_fullmesh there are some stats available but is there anything like echo 1 > /proc//net/mptcp_net/discover_paths_again like thing?
My MPTCP settings:
[ 0.412909] MPTCP: Stable release v0.91.2
kernel.osrelease = 4.1.35.mptcp
net.ipv4.tcp_allowed_congestion_control = lia reno cubic
net.ipv4.tcp_available_congestion_control = lia reno balia wvegas cubic olia
net.ipv4.tcp_congestion_control = lia (tried other ones too but the issue remains)
net.core.wmem_max = 115343360
net.core.rmem_max = 115343360
net.ipv4.tcp_rmem = 10240 87380 115343360
net.ipv4.tcp_wmem = 10240 87380 115343360
net.mptcp.mptcp_binder_gateways =
net.mptcp.mptcp_checksum = 0
net.mptcp.mptcp_debug = 0
net.mptcp.mptcp_enabled = 1
net.mptcp.mptcp_path_manager = fullmesh
net.mptcp.mptcp_scheduler = default
net.mptcp.mptcp_syn_retries = 10
net.mptcp.mptcp_version = 1