Notation
符号 |
含义 |
(F(w)) |
总目标函数 |
(w) |
待优化参数 |
(F_k(w)) |
第(k)个client的目标函数 |
(p_k) |
第(k)个client在总表函数中的占比 |
(E) |
每个local update的次数 |
(T) |
总迭代次数,即通讯次数为(T/E) |
(eta_t) |
(t)时刻学习率 |
(L) |
函数(F_k)为(L-smooth),即(
abla^2 F_k(w)leq L) |
(mu) |
函数(F_k)为(mu)convex,即(
abla^2 F_k(w)geq u) |
(sigma_k) |
(E(Vert
abla F_k(w_t^k, xi^k_t)-
abla F_k(w_t^k)Vert)leqsigma^2_k) |
(G) |
(E(Vert
abla F_k(w_t^k, xi^k_t)Vert)leq G^2) |
(F^*) |
最优函数值 |
(F_k^*) |
单独对第(k)个client优化得到的最优函数值 |
(Gamma) |
(F^*-sum_k\,p_kF_k^*),度量异质性 |
(eta_t) |
(t)时刻的学习率 |
(kappa) |
(frac{L}{mu}),可近似看为条件数 |
(w^*) |
最优参数 |
(xi_t^k) |
client (k)在(t)时刻进行随机梯度下降选出的样本 |
假设
- 函数(F_k)为(L-smooth),对于所有的(k)
- 函数(F_k)为(mu-convex),对于所有的(k)
- client(k)每次计算的随机梯度的方差是(sigma^2_k)有界的
- 所有计算的随机梯度的范数是(G)有界的
全参与下的收敛性证明
引入两个变量,(v^k_t)和(w_t^k),
[egin{align*}
v_{t+1}^k &= v_t^k - eta_t
abla F_k(w_t^k, xi_t^K)\
w_{t+1}^k &= left{ egin{matrix} v_{t+1} quad & ext{for }t+1
otin I_E\ sum_k^{N}\,p_k v_{k+1}^k quad & ext{for }t+1 in I_Eend{matrix}
ight.
end{align*}
]
定义
[egin{align*}
ar v_t &= sum
olimits_k \, p_k v_t^k\
ar w_t &= sum
olimits_k \, p_k w_t^k\
ar g_t &= sum
olimits_k\, p_k
abla F_k(w_t^k)\
g_t &=sum
olimits_k p_k
abla F_k(w_t^k, xi_k^t)
end{align*}
]
由于(tin I_E),在能够交换的迭代轮次才能获取参数的更新(w),变量(v)用来表示在不能进行交换数据的轮次的参数。由于全局参与,(ar v_t = ar w_t)对于所有的(t),而且(ar v_{t+1} = ar w_t - eta_t g_t)。
个人理解:
要证明收敛性需要证明,参数是收敛的,由于参数(ar w_t)是根据梯度下降求出来的,所以需要证明,
[EVert ar w_{t+1} - w^*Vert leq l(ar w_{t} - w)
]
即当前迭代的参数和最优点的(w^*)的距离是小于上一次迭代参数与最优参数的距离,而且(l)函数是可以递推的。也就是说,当前迭代的参数和最优点的(w^*)的距离的上界是逐渐减小的。
文章没有选择(ar w - w^*)而是选择了(ar v - w^*),因为(ar v)是对应所有client的,在部分参与的场景下,(ar w)是偏差的。
[egin{align*}
EVert ar v_{t-1} - w^*Vert^2 &= EVert ar w_t - eta_t g_t -w^*-eta_t ar g_t + eta_t ar g_tVert^2\
&= Eleft(Vert ar w_t -eta _t ar g_t -w^*Vert ^2 + 2eta_t <ar w_t-eta_tar g_t -w^*, -g_t+ar g_t> + eta_t^2 Vertar g_t-g_tVert^2
ight)
end{align*}label{eq:1} ag{1}
]
上面之所以拆分出(-eta_t ar g_t + eta_t ar g_t)的原意是想利用(E(eta_t g_t - eta_t ar g_t)=0)来对(
ef{eq:1})进行拆分。
对于(Eleft(Vert ar w_t -eta _t ar g_t -w^*Vert ^2
ight))继续进行计算
[egin{align*}
Vert ar w_t - eta_t ar g_t - w^*Vert^2&= Vert ar w_t - w^*Vert^2 - 2eta_t<ar w_t-w^*, ar g_t> +eta_t^2Vert ar g_tVert^2\ label{eq:2} ag{2}
end{align*}
]
根据(L-smooth)[1]
[egin{align*}
Vert
abla F_k(w_t^k)Vert^2 leq 2L(F_k(w_t^k - F_k^*)) label{eq:3} ag{3}
end{align*}
]
因为二范数为凸函数再结合(~
ef{eq:3}),得到
[egin{align*}
eta^2_t Vert ar g_t Vert^2 &leq eta_t^2 sum \, p_kVert
abla F_k(w_t^k)Vert^2\
&leq 2Leta_t^2(F_k(w_t^k - F_k^*))
end{align*}
]
对于(2eta_t <ar w_t -w^*, ar g_t>)展开
[egin{align*}
-2eta_t <ar w_t -w^*, ar g_t>&=-2eta_t sum\, p_k <ar w_t-w^*,
abla F_k(w^k_t)> \
&= -2eta_t sum\, p_k <ar w_t - w_t^k,
abla F_k(w_t^k)>-2eta_tsum\, p_k < w_t^k - w_t^k,
abla F_k(w_t^k)>
end{align*}
]
根据可惜施瓦茨不等式和矩阵不等式得到
[egin{align*}
-2<ar w_t-w_t^k,
abla F_k(w_t^k) > leq frac{1}{eta_t} Vert ar w_t - w_t^kVert^2+eta_k Vert
abla F_k(w_t^k)Vert^2
end{align*}
]
根据(mu-convex)得到
[egin{align*}
-<w_t^k - w^*,
abla F_k(w_t^k)> leq -(F_k(w_t^k)-F_k(w^*)) - frac{mu}{2}Vert w_t^k-w^*Vert^2
end{align*}
]
因此(~
ef{eq:2})写为
[egin{align*}
Vert ar w_t - eta_t ar g_t - w^*Vert^2&= Vert ar w_t - w^*Vert^2 - 2eta_t<ar w_t-w^*, ar g_t> +eta_t^2Vert ar g_tVert^2\
& leq Vert ar w_t - w^*Vert^2 + 2Leta_t^2(F_k(w_t^k - F_k^*)) +
\ &quadeta_tsum\, p_kleft(frac{1}{eta_t} Vert ar w_t - w_t^kVert^2+eta_k Vert
abla F_k(w_t^k)Vert^2
ight) -\
&quad2eta_t sum\, p_k(F_k(w_t^k)-F_k(w^*) - frac{mu}{2}Vert w_t^k-w^*Vert^2)\
& leq(1-mueta_t)Vert ar w_t- w^*Vert^2 + sum\, p_k Vert ar w_t-w^k_tVert^2+
\&quad4Leta_t^2 sum \, p_k(F_k(w_t^k)-F_k^*) - 2eta_tsum\, p_k(F_k(w_t^k)-F_k(w^*))
end{align*}
]
拿出后面的(4Leta_t^2 sum \, p_k(F_k(w_t^k)-F_k^*) - 2eta_tsum\, p_k(F_k(w_t^k)-F_k(w^*)))一项,定义(gamma_t=2eta_t(1-2Leta_t)),同时假定(eta_t)随时间非增(也就是学习率是衰减的)且(eta_tleq frac{1}{4L})。可以得到定义的(eta_tleqgamma_tleq 2 eta_t)。整理得到
[egin{align*}
&4Leta_t^2 sum \, p_k(F_k(w_t^k)-F_k^*) - 2eta_tsum\, p_k(F_k(w_t^k)-F_k(w^*))\
&=-2eta_t(1-2Leta_t)sum \, p_k(F_k(w_t^k)-F_k^*)+2eta_tsum\, p_k(F_k(w^*)-F_k*)\
&=-gamma_t sum \, p_k(F_k(w_t^k)-F^*) + (2eta_t - gamma_t)sum\, p_k(F^*-F_k^*)\
&=-gamma_t sum\, p_k(F_k(w_t^k)-F^*) + 4Leta_t^2Gamma
end{align*}
]
第三个等号将(F_k(w_t^k-F_k^*))拆分为(F_k(w_t^k)-F^*+F^* - F_k^*)
[egin{align*}
sum\, p_k(F_k(w_t^k)-F^*) & = sum\, p_k (F_k(w_t^k)-F_k(ar w_t)) + sum\, p_k(F_k(ar w_t)-F^*)\
& geq sum\, p_k <
abla F_k(ar w_t), w_t^k-ar w_t> +F_k(ar w_t)-F^*\
&geq -frac 1 2sum\, p_k left[eta_tVert
abla F_k(ar w_t)Vert^2 + frac{1}{eta}Vert w_t^k -ar w_tVert^2
ight] + (F(ar w_t)-F^*)\
&geq -sum\, p_k left[eta_t L(F_k(ar w_t) - F_k^*) + frac{1}{2eta}Vert w_t^k -ar w_tVert^2
ight] + (F(ar w_t)-F^*)
end{align*}
]
综上所述,
[egin{align*}
&4Leta_t^2 sum \, p_k(F_k(w_t^k)-F_k^*) - 2eta_tsum\, p_k(F_k(w_t^k)-F_k(w^*))\
&leq gamma_t sum\, p_k left[eta_t L(F_k(ar w_t) - F_k^*) +frac{1}{2eta}Vert w_t^k -ar w_tVert^2
ight]-gamma(F(ar w_t)-F^*)+ + 4Leta_tGamma\
& = gamma_t(eta_tL-1)sum \, p_k(F_k(ar w_t) - F^*) + (4Leta_t^2+gamma_teta_tL)+frac{gamma}{2eta_t}sum \, p_k Vert w_t^k - ar w_tVert^2\
&leq 6Leta_t^2Gamma + sum\, p_k Vert w_t^k - ar w_tVert^2
end{align*}
]
最后一个不等式取得的原因是((eta_tL-1)leq -frac{3}{4}),(sum \, p_k(F_k(ar w_t) - F^*)geq0)
因此
[Vert ar w_t -eta _t ar g_t -w^*Vert ^2leq (1-mu eta_t)Vert ar w_t- w^*Vert^2 +6Leta_t^2Gamma + sum\, p_k Vert w_t^k - ar w_tVert^2
]
加上梯度方差有限假设,即
[egin{align*}
EVert g_t - ar g_tVert^2 &= EVert sum \, p_k(
abla F_k(w_k, xi_t^k) -
abla F_k(w_t^k))Vert\
& leq sum\, p_k^2 sigma_k^2
end{align*}
]
[egin{align*}Esum\, p_k Vert ar w_t - w^k_tVert^2 &= Esum \,p_kVert w^k_t-ar w_{t_0}-(ar w_t - ar w_{t_0})Vert^2\
&leq Esum\, p_k Vert w_t^k - ar w_{t_0}Vert^2\
&leq sum p_k Esum
olimits_{t=t_0}^{t-1}\, (E-1)eta^2_t Vert
abla F_k(w_t^k, xi_t^k)Vert\
&leq 4eta_t^2 (E-1)^2 G^2
end{align*}
]
因此最终,令(Delta_t=EVert ar w_{t+1}-w^*Vert)
[egin{align*}
Delta_{t+1}leq(1-mueta_t)Delta_t + eta_t^2 B
end{align*}
]
(B=sum\, p_k^2 sigma_k^2+6LGamma+8(E-1)^2G^2)
参考资料
- Francis Bach, Statistical machine learning and convex optimization