内容:本文主要分析Google对TCP快速恢复算法的改进,即Proportional Rate Reduction for TCP。
内核版本:3.2.12
作者:zhangskd @ csdn blog
patch:Proportional Rate Reduction for TCP.
这个patch包含在3.2之后的版本中。
patch描述
以下是提交者Nandita Dukkipati对这个patch的描述:
This patch implements Proportional Rate Reduction (PRR) for TCP.
PRR is an algorithm that determines TCP's sending rate in fast recovery.
PRR avoids excessive window reductions and aims for the actual congestion window size
to be as close as possible to the window determined by the congestion control algorithms.
PRR also improves accuracy of the amount of data sent during loss recovery.
This patch implements the recommended flavor of PRR called PRR-SSRB
(Proportional rate reduction with slow start reduction bound) and replaces the existing rate
halving algorithm. PRR improves upon the existing Linux fast recovery under a number of
conditions including:
1) burst losses where the losses implicitly reduce the amount of outstanding data (pipe)
below the ssthresh value selected by the congestion control algorithm and,
2) losses near the end of short flows where application runs out of data to send.
As an example, with the existing rate halving implementation a single loss event can cause
a connection carrying short web transactions to go into the slow start mode after the recovery.
This is because during recovery Linux pulls the congestion window down to packets_in_flight + 1
on every ACK. A short Web response often runs out of new data to send and its pipe reduces to
zero by the end of recovery when all its packets are drained from the network. Subsequent HTTP
responses using the same connection will have to slow start to raise cwnd to ssthresh. PRR on
the other hand aims for the cwnd to be as close as possible to ssthresh by the end of recovery.
patch实现
以下是核心代码,完整代码可见对应的patch。
@include/linux/tcp.h
struct tcp_sock { ... /* Congestion window at start of Recovery. 进入Recovery前的拥塞窗口*/ u32 prior_cwnd; /* Number of newly delivered packets to receiver in Recovery. * 实际上用于统计data_rate_at_the_receiver,数据离开网络的速度。 */ u32 prr_delivered; /* Total number of pkts sent during Recovery. * 实际上用于统计sending_rate,数据进入网络的速度。 */ u32 prr_out; ... }
@net/ipv4/tcp_input.c
static inline void tcp_complete_cwr (struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); /* Do not moderate cwnd if it's already undone in cwr or recovery. */ if (tp->undo_marker) { if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR) tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh); else /* PRR */ tp->snd_cwnd = tp->snd_ssthresh; /* 防止不必要的进入慢启动*/ tp->snd_cwnd_stamp = tcp_time_stamp; } tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR); }
/* This function implements the PRR algorithm, specifically the PRR-SSRB * (proportional rate reduction with slow start reduction bound) as described in * http://www.ietf.org/id/draft-mathis-tcpm-proportional-rate-reduction-01.txt. * It computes the number of packets to send (sndcnt) based on packets newly * delivered: * 1) If the packets in flight is larger than ssthresh, PRR spreads the cwnd * reductions across a full RTT. * 2) If packets in flight is lower than ssthresh (such as due to excess losses * and/or application stalls), do not perform any further cwnd reductions, but * instead slow start up to ssthresh. */ static void tcp_update_cwnd_in_recovery (struct sock *sk, int newly_acked_sacked, int fast_rexmits, int flag) { struct tcp_sock *tp = tcp_sk(sk); int sndcnt = 0; /* 对于每个ACK,可以发送的数据量*/ int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp); if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) { /* Main idea : sending_rate = CC_reduction_factor * data_rate_at_the_receiver, * 按照拥塞算法得到的减小因子,按比例的减小pipe,最终使pipe收敛于snd_ssthresh。 */ u64 dividend = (u64) tp->snd_ssthresh * tp->prr_delivered + tp->prior_cwnd - 1; sndcnt = div_u64(dividend, tp->prior_cwnd) - tp->prr_out; } else { /* tp->prr_delivered - tp->prr_out首先用于撤销之前对pipe的减小,即首先让网络中的数据包恢复守恒。 * 然后,tp->prr_delivered < tp->prr_out,因为目前是慢启动,网络中数据包开始增加: * 对于每个ACK,sndcnt = newly_acked_sacked + 1,使pipe加1,即慢启动。 * delta使pipe最终收敛于snd_ssthresh。 */ sndcnt = min_t(int, delta, max_t(int, tp->prr_delivered - tp->prr_out, newly_acked_sacked) + 1); } sndcnt = max(sndcnt, (fast_rexmit ? 1 : 0)); tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt; }
@tcp_ack()
/* count the number of new bytes that the current acknowledgement indicates have * been delivered to the receiver. * newly_acked_sacked = delta(snd.una) + delat(SACKed) */ newly_acked_sacked = (prior_packets - tp->packets_out) + (tp->sacked_out - prior_sacked); ... tcp_fastretrans_alert(sk, prior_packets - tp->packets_out, newly_acked_sacked, flag);
背景
旧的快速恢复算法
There are two widely deployed algorithms used to adjust the congestion window during fast recovery:
(1) standard algorithm described in RFC3517
(2) non-standard algorithm implemented in Linux, rate halving
旧算法的不足之处
Linux suffers from excessive congestion window reductions while RFC3517 transmits
large bursts under high losses.
In pratice both can be either too conservative or too aggressive resulting in a long recovery time or
in excessive retransmissions.
新的快速恢复算法
A new fast recovery algorithm, proportional rate reduction (PRR).
PRR has been approved to become the default Linux fast recovery algorithm for Linux 3.x.
Fast recovery in RFC standards
(1)算法
Algorithm: RFC 3517 fast recovery
On entering recovery:
// cwnd used during and after recovery.
cwnd = ssthresh = FlightSize / 2
// Retransmit first missing segment.
fast_retransmit()
// Transmit more if cwnd allows.
Transmit MAX(0, cwnd - pipe)
For every ACK during recovery:
update_scoreboard() pipe = (RFC 3517 pipe algorithm)
Transmit MAX(0, cwnd - pipe)
(2)缺陷
Standard can be either too aggressive or too conservative.
1. Half RTT silence
The algorithm waits for half of the received ACKs to pass by before transmitting anything after
the first fast retransmit. This is because cwnd is brought down to ssthresh in one step, so it
takes cwnd - ssthresh ACKs for pipe to go below cwnd, creating a slient period for half of an
RTT. This design wastes precious opportunities to transmit which sometimes result in nothing
being sent during recovery. This in turn increases the chances of timeouts.
2. Aggressive and bursty retransmissions
The standard can transmit large bursts on a single received ACK. This is because cwnd - pipe
can be arbitrary large under burst losses or inaccurate estimation of losses. Furthermore, the
more losses there are, the larger the bursts transmitted by the standard.
Note that both problems occur in the context of heavy losses, wherein greater than or equal to
half of the cwnd is lost. Such heavy losses are surprisingly common for both Web and YouTube.
Fast recovery in Linux
(1)算法
Linux implements the rate halving algorithm in recovery.
When cwnd is reduced, Linux sends data in respones to alternate ACKs during recovery,
instead of waiting for cwnd/2 dupacks to pass as specified in the standard.
(2)缺陷
A minor problem with rate halving is that it is based on the original Reno TCP that always
halved the cwnd during fast recovery. Several modern congestion control algorithms, such
as CUBIC, reduce the window by less than 50% so unconditionally halving the rate is no
longer appropriate.
While in recovery, Linux prevents bursts by reducing cwnd to pipe+1 on every ACK that reduces
pipe for some reason. This implies that if there is insufficient new data available (e.g., because
the application temporarily stalls), cwnd can be become one by the end of recovery. If this happens,
then after recovery the sender will slow start from a very small cwnd even though only one segment
was lost and there was no timeout!
The main drawbacks of the fast recovery in Linux are its excessive window reductions and conservative
retransmissions, which occur for the following reasons:
1. Slow start after recovery
Even for a single loss event, a connection carrying short Web responses can complete recovery with a
very small cwnd, such that subsequent responses using the same connection will slow start even
when not otherwise required.
2. Conservative retransmissions
There are at least two scenarios where retransmissions in Linux are overly conservative. In the presence
of heavy losses, when pipe falls below ssthresh, Linux (re)transmits at most one packet per ACK during
the rest of recovery. As a result, recovery is either prolonged or it enters an RTO.
A second scenario is that halving asumes every ACK represents one data packet delivered. However,
lost ACKs will cause Linux to retransmit less than half of the congestion window.
PRR-SSRB
PRR-SSRB (proportional rate reduction with slow start reduction bound).
(1)大体分析
算法的一些属性:
Spreads out window reduction evenly across the recovery period.
For moderate loss, converges to target cwnd chosen by CC.
Maintains ACK clocking even for large burst losses.
Precision of PPR-SSRB is derived from DeliveredData, which is not an estimator.
Banks the missed opportunities to send if application stalls during recovery.
Less sensitive to errors of the pipe estimator.
(因为sndcnt和pipe无直接关系,是通过过prr_delivered和prr_out控制的)
简而言之,网络中每N个包退出(被接收端缓存了),则发送beta * N个包(重传或新的)。
网络中数据包守恒的情况:
sndcnt = prr_delivered - prr_out; /* 1比1的兑换关系*/
网络中数据包按比例减少:
sndcnt = (snd_sshresh / prior_cwnd) * prr_delivered - prr_out; /* 1/beta比1的兑换关系*/
举例来说,使用cubic算法,beta为0.7。如果有10个包被接收或缓存了(退出网络了),则可以发送7个包(重传或新的),
这样一来,网络中存在的数据包就减少了3个,在整个快速恢复的过程中,网络中存在的数据量逐渐减少为初始的0.7。
这就是所谓的按比例减小,减小的比例即为30%。
<-------------------------收N个--------------------------
Server Client
----------------发0.7 * N个----------------->
实际上是在用prr_delivered和prr_out来实时测量网络中存在的数据段个数(所以说更加准确,因为这不是猜测)。
这是一个渐变的过程,不断兑换的过程,最终网络中存在的数据量和拥塞窗口都收敛于snd_ssthresh。
(2)详细原理
The PRR algorithm determines the number of segments to be sent per ACK during recovery to balance two goals:
1) a speedy and smooth recovery from losses
2) end recovery at a congestion window close to ssthresh
The foundation of the algorithm is Van Jacobson's packet conservation principle:
segments delivered to the receiver are used as the clock to trigger sending additional segments into the network.
此算法由两部分组成:
1)The first part, the proportional part is active when the number of outstanding segments
(pipe) is larger than ssthresh, which is typically true early during the recovery and under light losses. It gradually
reduces the congestion window clocked by the incoming acknowledgements. The algorithm is patterned after
rate halving, but uses a fraction that is appropriate for the target window chosen by the congestion control algorithm.
For example, when operating with CUBIC congestion control, the proportional part achieves the 30% window
reduction by spacing out seven new segments for every ten incoming ACKs (more precisely, for ACKs reflecting 10
segments arriving at the receiver).
At the end of recovery, prr_delivered approaches RecoverFS and prr_out approaches ssthresh.
2)If pipe becomes smaller than ssthresh (such as due to excess losses or application stalls during recovery), the
second part of the algorithm attempts to inhibit any further congestion window reductions. Instead it performs slow
start to build the pipe back up to ssthresh subject to the availability of new data to send.
It achives this by first undoing the past congestion window reductions performed by the PRR part, reflected in the
difference prr_delivered - prr_out. Second, it grows the congestion window just like TCP does in slow start algorithm.
(3)测试效果
RFC 3517 experiences (compared to PRR) :
2.6% more timeouts
3% more retransmissions
29% increase detected lost retransmissions
Similar transaction times
Linux fast recovery Rate halving (compared to PRR) :
5% more timeouts
3-5% longer transaction times
lower cwnd values at the end of recovery
Possible bad influence of PRR :
PRR achieves good performance with a slight increase of retransmission rate.