青年研究员 - 复旦大学信息科学与工程学院电子工程系.pdf
Decentralized Stochastic Proximal Gradient Descent with Variance Reduction over Time-varying Networks Xuanjie Li, Yuedong Xu ∗, Jessie Hui Wang †, Xin Wang ‡and John C.S. Lui § January 25, 2022 arXiv:2112.10389v2 [cs.LG] 23 Jan 2022 Abstract In decentralized learning, a network of nodes cooperate to minimize an overall objective function that is usually the finite-sum of their local objectives, and incorporates a non-smooth regularization term for the better generalization ability. Decentralized stochastic proximal gradient (DSPG) method is commonly used to train this type of learning models, while the convergence rate is retarded by the variance of stochastic gradients. In this paper, we propose a novel algorithm, namely DPSVRG, to accelerate the decentralized training by leveraging the variance reduction technique. The basic idea is to introduce an estimator in each node, which tracks the local full gradient periodically, to correct the stochastic gradient at each iteration. By transforming our decentralized algorithm into a centralized inexact proximal gradient algorithm with variance reduction, and controlling the bounds of error sequences, we prove that DPSVRG converges at the rate of O(1/T ) for general convex objectives plus a non-smooth term with T as the number of iterations, while DSPG converges at the rate O( √1T ). Our experiments on different applications, network topologies and learning models demonstrate that DPSVRG converges much faster than DSPG, and the loss function of DPSVRG decreases smoothly along with the training epochs. 1 Introduction Decentralized algorithms to solve finite-sum minimization problems are crucial to train machine learning models where data samples are distributed across a network of nodes. Each of them communicates with its onehop neighbors, instead of sending information uniformly to a centralized server. The driving forces toward decentralized machine learning are twofold. One is the ever-increasing privacy concern [1–4], especially when the personal data is collected by smartphones, cameras and wearable devices. In fear of privacy leakage, a node is inclined to keeping and computing the raw data locally, and communicating only with other trustworthy nodes. The other is owing to the expensive communication. Traditional distributed training with a centralized parameter server requires that all nodes push gradients and pull parameters so that the ingress and egress links of the server can easily throttle the traffic. By removing this server and balancing the communications in a network, one can reduce the wall-clock time of model training several folds [5]. Consensus-based gradient descent methods [6–9] are widely used in decentralized learning problems because of their efficacy and simplicity. Each node computes the local gradient to update its parameter with the local data, exchanges this parameter with its neighbors, and then calculates the weighted average as the start of the next training round. To avoid overfitting the data, `1 or trace norms are introduced as an additional penalty to the original loss function, yet the new loss function is non-smooth at kinks. Decentralized Proximal Gradient (DPG) [10] leverages a proximal operator to cope with the non-differentiable part. Its stochastic version, Decentralized Stochastic Proximal Gradient (DSPG) [11], reduces the computation complexity per-iteration by using the stochastic gradient other than the full gradient. Stochastic gradient methods tend to have the potentially large variance and their performance relies on the tuning of a decaying learning rate sequence. Several variance reduction algorithms have been proposed to solve this problem in the past decade, e.g. SAGA [12], SVRG [13], SCSG [14] and SARAH [15]. Variance reduction is deemed as a breakthrough in the first-order optimization that accelerates Stochastic Gradient Descent (SGD) to the linear convergence rate under smooth and strongly convex conditions. The variance reduction technique is ∗ Xuanjie Li and Yuedong Xu are with the School of Information Science and Technology, Fudan University, Shanghai 200237, China (e-mail: {lixuanjie20, ydxu}@fudan.edu.cn). † J. H. Wang is with the Institute for Network Sciences and Cyberspace, Tsinghua University, Beijing 100084, China, and also with the Beijing National Research Center for Information Science and Technology, Beijing 100084, China (e-mail: jessiewang@tsinghua.edu.cn). ‡ Xin Wang is with the Key Laboratory of EMW Information (MoE), Department of Communication Science and Engineering, Fudan University, Shanghai 200433, China (e-mail: xwang11@fudan.edu.cn). § John C. S. Lui is with the Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong (e-mail: cslui@cse.cuhk.edu.hk). 1 more imperative in decentralized machine learning [16] for two reasons. Firstly, the local averaging is inefficient to mitigate the noise of the models in decentralized topologies. In light of the limited number of local models available at a node, the averaged model has a much larger variance compared to the centralized setting. Secondly, the intermittent network connectivity in real-world brings the temporal variance of stochastic gradients. Hence, the underlying network topology is time-varying that introduces further randomness in the estimation of local gradients temporally. Owing to the above considerations, variance reduction is generally used in decentralized learning, including DSA [17], Network-SVRG/SARAH [18] and GT-SVRG [19]. Our work also considers time-varying networks, where the link connectivity is changing over time. Timevarying networks are ubiquitous in the daily life [20]. For instance, a transmission pair is interrupted if one of the nodes moves out of the mutual transmission range in mobile networks. Besides, only the set of noninterfering wireless links can communicate simultaneously and different sets of links have to be activated in a time-division mode. Time-varying graphs not only allow the dynamic topology, but also are not guaranteed to be connected all the time. The representative works on convex problems include DIGing [21], PANDA [22] and the time-varying AB/push-pull method [23]. In this paper, we propose the Decentralized Proximal Stochastic Variance Reduced Gradient descent (DPSVRG) algorithm to address a fundamental problem, i.e., can the stochastic decentralized learning over temporal changing networks with a general convex but non-smooth loss function be faster and more robust. DPSVRG introduces a new variable for each node to help upper bound the local gradient variance and updates this variable periodically. During the training, this bound will decrease continually and the local gradient variance will vanish as well. In addition, DPSVRG leverages multi-consensus to further speed up the convergence rate that is applicable to both static and time-varying networks. Up to now, the only decentralized stochastic proximal algorithm [24] considers static graphs and strongly convex global objectives instead. We rigorously prove that the convergence rate of DPSVRG is O( T1 ) toward general convex objective functions in contrast to O( √1T ) for DSPG without variance reduction, where T refers to the number of iterations. We reformulate DPSVRG as a centralized inexact proximal gradient algorithm with variance reduction, namely Inexact Prox-SVRG, so that both of them have the same convergence rate. In this process, we delicately transform the parameter difference caused by decentralized environment into two errors, gradient error and proximal error, in Inexact Prox-SVRG. Through controlling these two errors under some mild conditions, we first prove the O( T1 ) convergence rate for Inexact Prox-SVRG with a general convex objective function plus a non-smooth term. We subsequently prove that DPSVRG satisfies these conditions as well and achieves the same convergence rate. The algorithm in [25] is similar to our Inexact Prox-SVRG, but it only considers the proximal errors. The gradient error analysis pertinent to the decentralized learning is of equal significance that demands a different algorithm and proof. DPSVRG and the associated Inexact Prox-SVRG are based on general convex functions, while most of SVRG related studies [26–28] target at strongly convex ones; in other word, our assumption is weaker and DPSVRG certainly has a broader applicability. DPSVRG is implemented and evaluated on a testbed with eight nodes. Taking DSPG as our baseline, we conduct comprehensive experiments on four datasets with the logistic regression plus an `1 regularizer. Experimental results show that DPSVRG has a much faster convergence rate and achieves higher accuracy within the same training epochs. We separately evaluate our variance reduction and multi-consensus techniques to show their impact on the convergence properties. DPSVRG also outperforms DSPG with different regularization coefficients and network topologies, thereby exhibiting the robustness of DPSVRG. The remainder of this paper is structured as follows. Section 2 presents the mathematical model. We propose a noval DPSVRG algorithm in Section 3. Section 4 proves the sublinear convergence of the proposed algorithm and Section 5 provides experimental results to validate its efficacy. Section 6 concludes this work. 2 Mathematical Model In this section, we present the network model and basic assumptions on decentralized machine learning with a class of non-smooth objective functions. 2.1 Network Model Time-varying graph. Consider an undirected time-varying graph sequence G t = {V, E t }, where V is the set of nodes with |V| = m and E t is the set of undirected edges at slot t, which is changeable over time. The pair-wise nodes connected by an edge are able to communicate with each other. We denote Nit as the set of neighbors of node i at time t including node i itself, i.e. Nit = {j | (i, j) ∈ E t } ∪ {i}. The connectivity of graph G t plays a key role in the decentralized learning, and here we first state a couple of well accepted assumptions. Assumption 1 (Graph Connectivity) Graph sequence G t is said to be b-connected when the node set V with the S(j+1)b−1 t union of consecutive edge sets t=jb E is connected, i.e., each node can reach other nodes at every round j = 0, 1, 2, · · · . 2 Assumption 1 guarantees that the information of one node can reach any other node in finite time, which is weaker compared to the assumption of persistent connectivity. We define a non-negative time-varying matrix t Wt ∈ Rn×n , where wij = 0 implies that node i and node j are not neighbors at time t. We make the following assumption on this matrix as below. Assumption 2 (Doubly Stochastic Matrix) Each matrix Wt is doubly stochastic that satisfies Wt · 1 = 1, (Wt )T 1 = 1, with 1 = [1, ..., 1]T . Any non-zero entry in Wt is no less than a positive constant η, i.e. ( t ≥ η, if wij >0 t wij . = 0, otherwise Matrices satisfying Assumption 2 are widely utilized in the decentralized learning as communication matrices to capture how the parameters from different nodes are aggregated. Specifically, after receiving the parameters from its neighbors, node i uses the weighted sum, specified by W , to update its out-of-date parameter. Therefore, the weight threshold in Assumption 2 guarantees that each node will always exert nonnegligible influences on its neighbors. In this way, the node can acquire sufficient information from its neighbors and minimize the global objective cooperatively. Given a sequence of doubly stochastic matrices spanning from W l to W g with l ≤ g, we denote Φ(l, g) as Φ(l, g) = W g W g−1 · · · W l . (l,g) Let φij be the entry in the ith row and the j th column of Φ(l, g). The asymptotic property of Φ(l, g) implies (l,g) that all the elements φij result as below. 1 as t → ∞. Formally, we introduce this known will be distributed in the vicinity of m Lemma 1 [6] Given a b-connected graph sequence (V, E t ) with |V| = m and a sequence of doubly stochastic matrices {At }, if each matrix At holds the property |σ2 (At )| < 1, where σ2 (At ) is the second largest eigenvalue of At , then the following inequality holds (l,g) φij − 1 ≤ Γγ g−l , m (1) for all g, l with g ≥ l. Here Γ = 2(1 + η −b0 ), γ = 1 − η b0 and b0 = (m − 1)b. 1 The above lemma states that all elements of Φ(l, g) will converge to m as (g − l) → ∞. With this feature, the matrix Φ is utilized as the aggregated communication matrix in our subsequent multi-consensus setting in which the nodes perform multiple communications with changeable matrices Wt in one training iteration. 2.2 Decentralized Optimization Model Our machine learning model possesses two prominent features that constitutes the key novelty of our work. (1) Handling non-smoothness. Regularization is a de facto standard to resolve the overfitting problem in many regression and classification problems. The basic idea of regularization is to introduce an additional penalty term in the loss function in which the `1 norm or the trace norm of weights are commonly used. Since these norms are non-differentiable at kink(s), the proximal gradient descent method is usually adopted to train these learning models. (2) Decentralization. Due to the privacy concern, each node is usually reluctant to sharing its raw data to any other node. An alternative approach is to perform the training locally and distributedly while sharing the model parameters. When the communication between a subset of nodes is not allowed, either because of restricted network connections or social distrust, distributed learning degenerates to decentralized learning on a graph that no node plays the role of a parameter server. In our decentralized learning, all m nodes in G are jointly solving a global optimization problem. Each node i possesses a local dataset Di with ni data records and a copy xi of the variable x ∈ Rd . The local dataset and parameter construct a local loss function Fi and all the nodes cooperatively minimize the average of all the loss functions, m 1 X Fi (x), min F (x) = (P1) m i=1 x∈Rd 3 where Fi (x) is a composite function, i.e., n Fi (x) = i 1 X f (x; ζij ) + h(x), ni j=1 and ζij denotes the data sampled from the local training dataset Di . Without loss of generality, f (·) refers to the loss of the training objective, and h(·) symbolizes any regularizer that is non-smooth (e.g. `1 norm). To facilitate qualitative studies, we make the following assumptions. Unless otherwise specified, we use k · k and h·, ·i to denote the `2 norm of vectors and the inner product of two vectors respectively throughout this paper. Assumption 3 The function f (x) is convex and differentiable, with an L-Lipschitz continuous gradient on the open set Rd , i.e., for all x, y ∈ Rd , and for all x, y ∈ Rd , f (x; ·) ≥ f (x; ·) + h∇f (x; ·), y − xi, (2) k∇f (y; ζil ) − ∇f (x; ζil )k ≤ Lky − xk, (3) where ∇ is the derivative operator. Assumption 4 The function h is lower semi-continuous and convex, and its effective domain dom h = {x ∈ Rd |h(x) < +∞} is closed. Similarly, for ∀x ∈ dom h, ∀y ∈ Rd , we have h(y) ≥ h(x) + hξ, y − xi, ∀ξ ∈ ∂h(x), (4) where ∂ denotes the subgradient operator. For the proper convex function h(·), the closedness and the lower semi-continuity have been proved to be equivalent [29]. With Assumption 4, h(·) is a closed function within its effective domain and we will show that this facilitates the proximal operator to tackle the non-smooth problem. Example. To better interpret the above assumptions, let us present a concrete example. Consider the decentralized least-squares method plus `1 norm regularizer in regression or classification problems. The Pan N T 2 loss function takes the form f (w) = i=1 P kai w − bi k2 on a dataset consisting of N points (ai , bi ). The regularization term is h(w) = kwk1 = j |wj |. It is easy to check that f is convex and L-smooth with T L = 2 maxi∈{1,...,N } kai ai k, and h is lower semi-continuous and convex. 3 DPSVRG Algorithm In this section, we propose a novel DSPG algorithm with variance reduction in decentralized networks. The procedures of updating and transmitting parameters are elaborated. 3.1 Variance Reduction for Decentralized SGD In the most decentralized algorithms, each node makes use of local data to update the local parameter to approximate the local minimum. Considering the high computation cost of local full gradient computing, stochastic gradient descent is more practical to calculate the approximation of the actual local gradient from a randomly selected subset of the data. We begin with a smooth objective function and then extend it to a non-smooth function shortly after. (k) At each iteration k, node i first samples a batch of data Bi ⊂ Di . The average gradient vi of the objective (k−1) function f (xi ; ·) with regard to data batch Bi is taken as the estimate of the actual local full gradient (k) vi = 1 X (k−1) j ∇f (xi ; ζi ). |Bi | j (5) ζi ∈Bi The standard update formula in the elementary SGD is presented as (k) xi (k−1) = xi (k) − αvi , j Bi ζij ∈Bi ∇f (·; ζi ) as ∇fi (·) and the local (k) It is worthy to note that vi is an unbiased where α denotes the step size. For ease of notation, we rewrite |B1i | gradient on the dataset Di is conveniently expressed as ∇fi (·). (k−1) (k) (k−1) estimator of ∇fi (xi ), i.e., EBi [vi ] = ∇fi (xi ). 4 (6) P (k) Root cause of variance. Although vi is an unbiased estimator of the local gradient, it reflects an inaccurate direction deviating from the local full gradient at some iterations, thus increasing the variance and may impact the convergence rate. This motivates us to study how to accelerate SGD by mitigating the variance (k) of vi . Suppressing Variance. We describe the basic idea of stochastic variance reduction gradient (SVRG) approach and generalize it to a decentralized network. In SVRG, the model update is conducted in two loops. The outer loop captures the full local gradient on all the local data, while the inner loop computes the gradient with a random batch and utilizes the full gradient to correct the batch gradient for the purpose of suppressing its variance. Why does the correction work? The original idea of variance reduction stems from basic statistics. Consider three random variables X, Y and Z satisfying Z = (X − Y ) + EY . Obviously Z is an unbiased estimator of EX, and the variance Var(Z) is smaller than Var(X) as long as X and Y exhibit a strong positive (k−1,s) ), with k correlation. In SVRG, X represents the local gradient sampled on a batch of data, i.e. ∇fiBi (xi and s indicating the inner and outer steps respectively. Given the parameter x̃is−1 at the s-th outer round, the local full gradient is ∇fi (x̃s−1 ) and the stochastic gradient on the batch Bi is ∇fiBi (x̃is−1 ). Here, Y is equivalent i Bi s−1 ). Although X and Y are different random variables, they are sampled on to ∇fi (x̃i ), and EY is ∇fi (x̃s−1 i (k−1,s) and x̃s−1 are relatively small so that they are highly the same batch of data and the differences between xi i correlated. When introducing the bias (Y − EY ) in the inner-loop local update, the variance of local gradient estimator is abated. To summarize, the update steps are given below (k−1,s) (k,s) = ∇fiBi (xi (k,s) ← xi vi xi (k−1,s) )) ) − (∇fiBi (x̃is−1 ) − ∇fi (x̃s−1 i (k,s) . (Gradient Step) − αvi Properties of SVRG. Due to this variance-reduction mechanism, the convergence rate of DPSVRG is independent of the variance of local stochastic gradients. The improvement of convergence rate will be more evident when the sampled batch Bi is relatively small, which leads to large gradient variance. 3.2 Decentralized Model Fusion In our decentralized network, each node i ∈ V maintains its local copy xi and executes gradient descent step as mentioned before. Owing to the data disparity of distributed nodes, parameters xi descend towards diverse directions and reach different values. Our goal is to achieve a consensus of all the local parameters to attain a global optimum x∗ . To meet this objective, each node communicates with its direct neighbors to exchange their updated parameters and takes the weighted average over its neighborhood prescribed by a time-varying matrix. This process is known as gossip [30]. Given a doubly stochastic matrix W(k,s) at iteration (k, s), a single step of gossip can be stated as X (k,s) (k,s) (k,s) xi ← Wij xj , ∀i ∈ V. (7) j∈Ni Multi-consensus. When the communication matrix W is fixed over time, the convergence rate of decentralized learning is largely dependent on the spectral gap of W and is higher with more communication steps within an iteration round [5,31,32]. This insight motivates us to utilize multi-consensus, i.e., nodes performs the gossip step for k times consecutively, where k denotes the inner step. Therefore, the consensus step at iteration (k, s) can be presented as follows, m X (k,s) (k,s) (k,s) xi = φij xj , (Consensus Step) j=1 (k,s) where φij denotes the i-th row and the j-th column element of a combination of k different communication (k,s) matrices, i.e., φij = [W(k,s),1 W(k,s),2 ...W(k,s),k ]i,j . When k is large, Lemma 1 indicates that each element 1 in matrix Φ(k, s) approaches m , which implies that the consensus step impels all the local parameters to the average parameter. Therefore, our algorithm is getting closer to the centralized counterpart as iteration proceeds, which retains higher convergence rate compared with decentralized ones. Furthermore, we prove a linear convergence rate in regard to s for our algorithm, which can compensate for the high communication cost introduced by multi-consensus. Indeed, we show that DPSVRG still converges faster than the baseline algorithm with the same communication rounds. 3.3 Proximal Operator Proximal Operator. Up to now, we only focus on the differentiable part f in (P1). A special operation, namely proximal operator, is needed to dispose the non-smooth function h. With the properties in Assumption 5 4, for a point z ∈ Rd and a scalar α, the proximal operator is defined as, 1 α 2 proxh {z} = arg min ky − zk + h(y) . y∈dom h 2α (8) As an essential part of Proximal Gradient Descent, the proximal operator is applied after the primitive gradient step z = x − α∇f (x) to map the original point x into a specific region. For example, by setting h(y) = kyk1 , Equation (8) yields the vector which is close to z and has small absolute values. Proximal Mapping Step. Motivated by Proximal Gradient Method, we apply the proximal mapping after the consensus step to deal with the non-differentiable part h. Node i substitutes z with the outcome of (k,s) the consensus step xi , and executes the following update procedure (k,s) xi (k,s) ← proxα h {xi }. (Proximal Mapping Step) The following lemmas guarantee the availability of the proximal operator and present several useful properties. Lemma 2 [10] (First Prox Theorem) Let h be a proper closed and convex function, then proxh is a singleton for any x ∈ dom h. Lemma 3 (Second Prox Theorem) Let h be a proper closed and convex function, then for any y, z ∈ dom h, the following statements are equivalent. (1) y = proxα h {z} (2) α1 (z − y) ∈ ∂h(y) (3) α1 hz − y, x − yi ≤ h(x) − h(y), ∀x ∈ dom h. Lemma 4 Let h be a proper closed and convex function. For any two points z1 , z2 ∈ dom h, we have α kproxα h {z1 } − proxh {z2 }k ≤ kz1 − z2 k. The closedness of domh guarantees the availability of proximal operator solution and the First Prox Theorem ensures the uniqueness of this solution. The second Prox Theorem exhibits some important properties of h which will be utilized in our convergence analysis. Practicability of Proximal Operator. Next, we check whether the closed-form expressions of the proximal operators can be obtained and applied in general machine learning tasks. In fact, the closed-form expressions of proximal operators are available in many cases of interest. For example, the closed-form expression of proximal operator of `1 norm is given by yi − αλ, yi > αλ h i α |yi | ≤ αλ , proxλk·k1 {y} = 0, i yi + αλ, yi < −αλ where [·]i denotes the i-th element of the vector. Therefore, our proposed algorithm can be generalized to a wide range of machine learning scenarios. 3.4 Algorithm Design Bringing gradient step, consensus step and proximal mapping step together, we present the sketch of our DPSVRG algorithm in Algorithm 1. For ease of explanation, we enumerate the extended operations beyond the classical SVRG algorithm [13]: (k,s) • We adopt qi tively. and q̂i (k,s) to denote the intermediate result of gradient step and consensus step respec- • Each node, indexed by i, only samples a single data record instead of a batch in every training round. The notation ∇fili (xi ) stands for the gradient of function f at xi with data li . • The number of inner steps Ks increases exponentially with the base β and the exponent s. (0,s+1) • Except for the initial state, x̃si always differs from xi 6 . Algorithm 1 DPSVRG at node i 1: Initiate xinit (0,1) 2: x̃0i = xinit , xi = xinit 3: for s = 1, 2, ..., S do 4: Ks = dβ s n0 e 5: compute ∇fi (x̃s−1 ) i 6: for k = 1, 2, ..., Ks do 7: randomly pick li ∈ {1, 2, ..., ni } (k−1,s) (k,s) ) + ∇fi (x̃is−1 ) ) − ∇fili (x̃s−1 8: vi = ∇fili (xi i (k,s) (k−1,s) (k,s) 9: qi = xi − αv Pm (k,s) i(k,s) (k,s) 10: q̂i = j=1 φi,j qj 11: 12: 13: (k,s) (k,s) xi = proxα } h {q̂i end for P (k,s) Ks xi x̃si = K1s k=1 0,s+1 K ,s 14: xi = xi s 15: end for Qualitatively analyzing the convergence of DPSVRG is very challenging. To address this challenge, we propose to transform it into a centralized inexact proximal gradient method with variance reduction. Inexact Proximal Gradient introduces an error term into proximal operator and attains an inexact version of proximal operator defined below. For a function h satisfying Assumption 4, a point z ∈ Rd and a scalar α, we claim x = proxα h,ε {z}, if there exists a scalar ε such that 1 kx − zk2 + h(x) ≤ min y∈dom h 2α 1 2 ky − zk + h(y) + ε. 2α (9) The error term ε indicates the difference of function values between the inexact proximal operation solution and the exact solution. Algorithm Reformulation. We aim to construct a centralized algorithm, assuming that there P is a virtual m 1 central node which stores all the training data D = {Di }, i ∈ [1, m] and takes the average x̄ = m i=1 xi as its parameter. This virtual node tracks the average of parameters of distributed nodes in Algorithm 1 at each training round by executing the centralized Inexact Prox-SVRG algorithm in Algorithm 2. Accordingly, this virtual node solves (P1) with a different form n min F (x) = x∈Rd 1X f (x; ζ j ) +h(x) n j=1 , | {z } (P2) f (x) where ζ j ∈ D. In Algorithm 2, notation lin represents P P the unionj set of sampled data li in Algorithm 1, i.e., 1 1 j lin = {l1 , ..., lm }, and f lin (x) = m f (x) = j∈lin j∈lin f (x; ζ ). Therefore, we can set up a relationship of m Pm li 1 function values between the decentralized and centralized algorithms, i.e., f lin (x) = m i=1 fi (x). The terms (k,s) (k,s) e and ε denote the errors of gradient computation and proximal operator respectively. The variable x̃ is similar to x̃i in Algorithm 1, which is introduced to reduce the gradient variance. Theorem 1 justifies this transform procedure and guarantees their equivalence on the optimal value. The detailed proof of Theorem 1 is shown in the supplementary. Theorem 1 Algorithm 1 and Algorithm 2 have the same optimal solution if we set (in Algorithm 2) m 1 X (k−1,s) ∇fili (xi ) − ∇fili (x̄(k−1,s) ) + m i=1 s−1 ∇fili (x̃s−1 ) − ∇fili (x̃is−1 ) + ∇fi (x̃s−1 ) − ∇f (x̃ ) , i i 1 kx̄(k,s) − y(k,s) k2 ε(k,s) = 2α 1 + hx̄(k,s) − y(k,s) , (y(k,s) − q̄(k,s) ) + pi, α Pm (k,s) Pm (k,s) 1 1 (k,s) (k,s) (k,s) with p ∈ ∂h(x̄ ), q̄ = m i=1 q̂i = m i=1 qi and y(k,s) = proxα }. h {q̄ e(k,s) = 7 (10a) (10b) Algorithm 2 Inexact Prox-SVRG 1: Initiate xinit 2: x̃0 = xinit , x(0,1) = xinit 3: for s = 1, 2, 3, ..., S do 4: Ks = dβ s n0 e 5: for k = 1, 2, ..., Ks do 6: pick lin = {l1 , ..., lm } 7: v(k,s) = ∇f lin (x(k−1,s) ) − ∇f lin (x̃s−1 ) + ∇f (x̃s−1 ) 8: q(k,s) = x(k−1,s) − α(v(k,s) + e(k,s) ) 9: x(k,s) = proxα {q(k,s) } h,ε(k,s) 10: end for P Ks x(k,s) 11: x̃s = K1s k=1 12: x(0,s+1) = x(Ks ,s) 13: end for The importance of Theorem 1 is embodied in the transform of the complex and decentralized DPSVRG into the inexact Prox-SVRG algorithm without a network structure. The error sequences e(k,s) and ε(k,s) in Algorithm 2 imply the gap between decentralized and centralized proximal gradient algorithms. More specifically, both errors are generated from the dissensus of local parameters of distributed nodes as follows. • Gradient error (e(k,s) ) presents the difference from the average of distributed gradients and the gradient Pm (k,s) (k,s) 1 − v(k,s) . This deviation decreases as local copies xi come of the average parameter, i.e., m i=1 vi close to each other. Furthermore, when the consensus is achieved, i.e., xi = xj and x̃i = x̃j , ∀i 6= j, e(k,s) becomes zero and the gradients utilized in two algorithms are the same from a global view even without the adjustment e(k,s) . • Proximal mapping error (ε(k,s) ) mainly describes the discrepancy between the average of distributed Pm (k,s) 1 } and the proximal mapping mapping of global average proximal mapping outcomes m i=1 prox{qi Pm (k,s) 1 parameter prox{ m i=1 qi }. Again, with local parameters tending towards consensus, qi will approach q̄ and ε(k,s) will converge to zero. To emphasize, the error terms are crucial to the convergence of two algorithms. Moreover, the variance reduction technique renders greater influence of error terms on the upper bound than that of the classical Inexact Proximal Gradient descent (IPG) algorithm [33], demanding for new techniques to control there errors to guarantee the convergence of two algorithms. In what follows, we will show how the variance reduction influences these two error terms and the convergence of our algorithm. Furthermore, we will probe into whether the error terms can be controlled so that DPSVRG achieves the same order of convergence rate as Inexact Prox-SVRG. 4 Convergence Analysis 4.1 Convergence of Inexact Prox-SVRG Inexact Prox-SVRG assumes the existence of a virtual centralized node storing the entire dataset and performing all the computations to solve (P2). For convenience, we consider the situation that a single data sample, indexed by l, l ∈ [1, n], is used at each training round (k, s). Denote by El the expectation on one data sample. To begin with, we introduce the following assumptions on the gradients and the aforementioned errors. Assumption 5 The `2 norm of the difference between v(k,s) and ∇f (x(k−1,s) ) is bounded by a constant, i.e. El kv(k,s) − ∇f (x(k−1,s) )k ≤ M ∀k, s, l. √ Assumption 6 For any fixed s, Eke(k,s) k and ε(k,s) are summable sequences with regard to k, i.e., ∞ X (k,s) Eke k<∞ ; k=1 ∞ p X ε(k,s) < ∞. (11) k=1 Assumption 5 bounds the error of the local gradient estimator v(k,s) with a constant M . Assumption 6 restricts the accumulated proximal error and the accumulated gradient error to be finite. Given the linkage between Inexact Prox-SVRG and DPSVRG, we prove the convergence of Inexact Prox-SVRG first. 8 δ Theorem 2 If Inexact Prox-SVRG satisfies Assumptions 3 ∼ 6, and α < L(4δ+8) with δ < 1, then ρ E[F (x̃S ) − F (x∗ )] <ρS ( + 1)E F (xinit ) − F (x∗ ) 2n0 Ekxinit − x∗ k2 C + + n0 (2α − 8α2 L) n0 (ρβ − 1) 8αL with a constant C and ρ = 1−4αL . Here x∗ refers to the optimal point of (P2) (or (P1)). Notation E denotes the expectation with all the data sampled in the training process. The Sketch of Proof. Due to the two-loop structure for variance reduction, the convergence analysis is carried out in two stages. The inner loop analysis aims to show that the gap between the expected objective function and the optimum tends to decrease as we execute the inner loop iteration. Similarly, the outer loop analysis demonstrates that this gap further decreases to zero as the number of outer loops approaches infinity. The detailed proof is presented in the supplementary. 8αL δ and ρ = 1−4αL , we have ρ < δ < 1. During S outer iterations, the expected Remark 1 As α < L(4δ+8) distance from the current objective PS value to the global optimum will decrease exponentially. However, if we take the total training rounds T = s=1 Ks into consideration, the convergence rate is at O( T1 ), which has the same order with the inexact proximal gradient method utilizing the full gradient. Before presenting the main proof of Theorem 2, we first show some necessary lemmas and leave their proofs in the supplementary file. First, we define the ε-subdifferential of convex function h at z (∂ε h(z)) as the set of p satisfying h(y) ≥ h(z) + hp, y − zi − ε, ∀y. (12) Lemma 5 Let h be a proper closed and convex function and x = proxα h,ε {z}. Then there exists g such that 1 (z + g − x) ∈ ∂ε h(x), α √ kgk ≤ 2αε. (13a) (13b) Lemma 6 Let h be a proper closed and convex function, then for any two points z1 , z2 ∈ dom h, there exists √ α kproxα h,ε {z1 } − proxh,ε {z2 }k ≤ kz1 − z2 k + 3 2αε. Lemma 7 The variance of the gradient estimator v(k,s) in Algorithm 2 can be bounded as the following El kv(k,s) − ∇f (x(k−1,s) )k2 ≤ 4L F (x(k−1,s) ) − F (x∗ ) + 4L (F (x̃s ) − F (x∗ )) . Lemma 6 provides a relationship between the parameters before and after the proximal mapping. Lemma 7 is utilized to bound the variance of the gradient estimator by the function values. Since both x(k−1,s) and x̃s are the estimators of x∗ , their function values will decrease during the training. Then, the variance of v(k,s) tends to be zero as k goes to infinity. Lemma 8 Assume that the nonnegative sequence {uK } satisfies the following recursion constraint u2K ≤ SK + K X λk uk , k=1 with {SK } an increasing sequence, λk ≥ 0 f or ∀λk and u20 ≤ S0 , then for all k ≥ 0, ! 21 K K X X 1 1 uK ≤ λk + SK + ( λk )2 . 2 2 k=1 (14) k=1 Proof 1 Inner loop analysis For the moment, we omit index s in (k, s) and substitute Ks and x̃s−1 with K, and x̃ respectively. We begin the proof with the update rules in Algorithm 2. Applying Lemma 5 to x(k) = proxα {q(k) } by defining h,ε(k) 1 w = (q(k) + g − x(k) ) ∈ ∂ε(k) h(x(k) ), (15) α 9 with kgk ≤ √ 2αε(k) and introducing q(k) = x(k−1) − α(v(k) + e(k) ) yield x(k−1) − x(k) = α(w + v(k) + e(k) ) − g , αξ (k−1) . (16) Hence, we have ξ (k−1) = (w + v(k) + e(k) ) − α1 g. We start the main analysis by evaluating the distance between the latest updated point and the optimum kx(k) −x∗ k2 , and bound it in the form of kx(k−1) −x∗ k2 in the following formula. kx(k) − x∗ k2 = kx(k−1) − αξ (k−1) − x∗ k2 = kx(k−1) − x∗ k2 + α2 kξ (k−1) k2 − 2αhξ (k−1) , x(k−1) − x∗ i = kx(k−1) − x∗ k2 + α2 kξ (k−1) k2 − 2αhξ (k−1) , x(k−1) − x(k) i (17) − 2αhξ (k−1) , x(k) − x∗ i (16) = kx(k−1) − x∗ k2 − α2 kξ (k−1) k2 − 2αhξ (k−1) , x(k) − x∗ i {z } | The next proposition shows an upper bound on −α2 kξ (k−1) k2 − 2αhξ (k−1) , x(k) − x∗ i. Proposition 1 With Assumptions 3, 5 and αL ≤ 1, we have − α2 kξ (k−1) k2 − 2αhξ (k−1) , x(k) − x∗ i ≤ −2α(F (x(k) ) − F (x∗ )) + 2αε(k) + 2α2 kv(k) − ∇f (x(k−1) )k2 p 1 + (2α2 ke(k) k + 6α 2αε(k) )M − 2αhx(k) − x∗ , e(k) − gi α ∗ (k) (k−1) )i. − 2αhx − x , v − ∇f (x The proof of proposition 1 is shown in the supplementary file. We apply this proposition to (17) and take the expectation on both sides of (17) w.r.t. sample batch lin . Notice that the expectation Elin [−2αhx − x∗ , v(k) − ∇f (x(k−1) )i] = −2αhx − x∗ , Elin [v(k) − ∇f (x(k−1) )]i = 0, since (x − x∗ ) is deterministic w.r.t. lin . Elin kx(k) − x∗ k2 ≤ kx(k−1) − x∗ k2 − 2αElin (F (x(k) ) − F (x∗ )) + 2α2 Elin kv(k) − ∇f (x(k−1) )k2 p + (2α2 Elin ke(k) k + 6α 2αε(k) )M 1 (k) ∗ (k) + 2αElin kx − x k ke k + kgk + 2αε(k) α ≤ kx(k−1) − x∗ k2 − 2αElin (F (x(k) ) − F (x∗ )) + 8α2 L F (x(k−1) ) − F (x∗ ) + 8α2 L F (x̃) − F (x∗ ) p + (2α2 Elin ke(k) k + 6α 2αε(k) )M 1 + 2αElin kx(k) − x∗ k Elin ke(k) k + kgk + 2αε(k) . α (18) The inequality utilizes Lemma 7 for Elin kv(k) − ∇f (x(k−1) )k2 . Summing (21) over k from 1 to K and taking the expectation w.r.t. all data in K iterations bring forth 2α K X E F (x(k) ) − F (x∗ ) ≤ kx(0) − x∗ k2 − Ekx(K) − x∗ k2 k=1 2 + 8α L K X E F (x(k−1) ) − F (x∗ ) + 8α2 LK F (x̃) − F (x∗ ) k=1 2 + 2α M K X (k) Eke K p X k + 6αM 2α ε(k) √ k=1 + K X k=1 2αEke(k) k + 2 p 2αε(k) Ekx(k) − x∗ k + 2α k=1 K X ε(k) , k=1 1 where we apply (13b) to enlarge kgk. With the rearrangement of (19) and assumption α < 4L , we have (2α − 8α2 L) K X E F (x(k) ) − F (x∗ ) + Ekx(K) − x∗ k2 k=1 10 (19) + 8α2 LE F (x(K) ) − F (x∗ ) ≤ kx(0) − x∗ k2 + 8α2 L F (x(0) ) − F (x∗ ) K X 2 + 8α LK F (x̃) − F (x ) + 2α M Eke(k) k ∗ 2 k=1 √ + 6αM 2α K p X ε(k) + 2α k=1 + K X K X ε(k) k=1 p 2αEke(k) k + 2 2αε(k) Ekx(k) − x∗ k. (20) k=1 We temporarily only keep Ekx(K) − x∗ k2 on the left side of (20) due to the positiveness of other two terms and obtain Ekx(K) − x∗ k2 ≤ kx(0) − x∗ k2 + 8α2 L F (x(0) ) − F (x∗ ) ∗ 2 2 + 8α LK F (x̃) − F (x ) + 2α M K X Eke(k) k k=1 √ + 6αM 2α K p X ε(k) + 2α k=1 + K X 2αEke(k) k + 2 K X ε(k) (21) k=1 p 2αε(k) Ekx(k) − x∗ k. k=1 Next, with this recursion model between Ekx(K) − x∗ k2 and Ekx(k) − x∗ k, we can set up a bound for Ekx(k) − (k) (0) x∗ k which only contains the objective function values, the √ squared `-2 norm of x , x and the error terms. To be specific, using Lemma 8 with λk = 2αEke(k) k + 2 2αε(k) and uk = Ekx(k) − x∗ k and taking the average of (21) yield the next proposition. Proposition 2 Let θ = 2α − 8α2 L, under Assumption 6, we have K Ekx(K) − x∗ k2 1 X E F (x(k) ) − F (x∗ ) + K Kθ k=1 8α2 L E F (x(K) ) − F (x∗ ) Kθ 2kx(0) − x∗ k2 16α2 L ≤ F (x̃) − F (x∗ ) + θ Kθ 2 16α L C + F (x(0) ) − F (x∗ ) + , (22) Kθ K p √ √ 2 2 √ with C is constant bounding the sum 3A2K + 2αBK + AK + 6αM 2αBK + AK + 2α2 M CK + √ √ 2 PK PK PK AK /θ, where AK = k=1 αEke(k) k + 2αε(k) , BK = k=1 ε(k) and CK = k=1 Eke(k) k. + Proposition 2 presents a rough relationship between the function values of the last state (x(K) ) and that of the initial state (x(0) ) in one inner loop. Its detailed proof is provided in the supplementary file. Outer loop analysis In the following, we consider to the h Pthe outer loopi and hreuse sPin (22). Owing i PKs Ks Ks 1 1 1 (k,s) (k,s) (k,s) ) ≥ E Ks k=1 F (x ) ≥ E F Ks k=1 x = EF (x̃s ). convexity of F (x), we have Ks k=1 EF (x Therefore, taking the expectation w.r.t all sampled data in s outer rounds, (22) becomes Ekx(Ks ,s) − x∗ k2 E (F (x̃s ) − F (x∗ )) + Ks θ 8α2 L + E F (x(Ks ,s) ) − F (x∗ ) Ks θ 2Ekx(0,s) − x∗ k2 16α2 L ≤ E F (x̃s−1 ) − F (x∗ ) + θ Ks θ 2 16α L C + E F (x(0,s) ) − F (x∗ ) + . Ks θ Ks (23) Equation (23) is a recursion model with variable s. Telescoping this recursion from s = 1 to S gives birth to Ekx(0,s+1) − x∗ k2 E (F (x̃s ) − F (x∗ )) + Ks θ 11 ρ E F (x(0,s+1) ) − F (x∗ ) 2Ks Ekx(0,s) − x∗ k2 ≤ ρ E F (x̃s−1 ) − F (x∗ ) + Ks−1 θ ρ C + E F (x(0,s) ) − F (x∗ ) + , 2Ks−1 Ks + (24) 2 where ρ = 16αθ L < δ < 1 with the constant δ. The inequality is due to the condition ρβ ≥ 2. Telescoping (24) over s from 1 to S, we obtain Ekx(0,S+1) − x∗ k2 E F (x̃S ) − F (x∗ ) + KS θ ρ + E F (x(0,S+1) ) − F (x∗ ) 2KS Ekx(0,1) − x∗ k2 < ρS E F (x̃0 ) − F (x∗ ) + K0 θ C ρS ρ E F (x(0,1) ) − F (x∗ ) + , (25) + 2K0 n0 ρβ − 1 PS−1 i=0 ρi PS−1 i S ρS − 1S ρS β ρβ−1 < ρβ−1 in the inequality. We complete the (0,S+1) ∗ 2 proof of Theorem 2 by using x̃0 = x(0,1) = xinit and removing Ekx KS θ−x k + 2Kρ S E F (x(0,S+1) ) − F (x∗ ) due where we apply β S−i = i=1 (ρβ) βS = β1S 1−(ρβ) 1−ρβ = to their positiveness. δ 1 1 , 4L , L} = Note that, in the above analysis, α needs to satisfy the following condition: α < min{ L(4δ+8) δ δ 16α2 L < δ. L(4δ+8) , where the first condition α < L(4δ+8) comes from ρ = θ 4.2 Convergence of DPSVRG DPSVRG is reducible to Inexact Prox-SVRG when the gradient and the proximal mapping errors of Inexact Prox-SVRG are carefully chosen (as shown in Theorem 1). The remaining task is to examine whether DPSVRG satisfies the conditions in Theorem 2 so that it can achieve the same convergence rate with that of Inexact Prox-SVRG. Specifically, we prove that under the following assumption, the summability of error terms in (10) is validated to be in line with Assumption 6. Assumption 7 For each node i in (P1), the `-2 norm of the gradients of fi (x) and the subgradients of h(x) are bounded by certain constants, k∇fi (xi )k ≤ Gf , ∀i ∈ {1, ..., m}, (26) kzk ≤ Gh , ∀z ∈ ∂h(xi ). (27) This assumption confines the gradients of f for each node and the subgradients of h. Now we state the convergence of Distributed Prox-SVRG in Theorem 3. Theorem 3 With assumptions 1 ∼ 5 and 7, DPSVRG achieves the identical convergence rate with Inexact Prox-SVRG. Proof 2 Taking the `2 norm and the expectation of both sides in (10a) yields m 1 X (k−1,s) Eke(k,s) k ≤ Ek∇fili (xi ) − ∇fili (x̄(k−1,s) )k + m i=1 | {z } (28) Ek∇fili (x̃s−1 ) − ∇fili (x̃is−1 )k + Ek∇fi (x̃is−1 ) − ∇fi (x̃s−1 )k . Now we process the first term on the right and obtain (k−1,s) Ek∇fili (xi ) − ∇fili (x̄(k−1,s) )k ni 1 X (k−1,s) k∇fij (xi ) − ∇fij (x̄(k−1,s) )k = ni j=1 (k−1,s) ≤ Lkxi (29) − x̄(k−1,s) k. where the inequality applies the smoothness of f in Assumption 3. The second and the third terms in (28) can be bounded with the same technique in (29) as follows, Ek∇fili (x̃s−1 ) − ∇fili (x̃is−1 )k ≤ Lkx̃s−1 − x̃is−1 k, 12 (30a) Ek∇fi (x̃s−1 ) − ∇fi (x̃s−1 )k ≤ Lkx̃s−1 − x̃is−1 k. i Thus we obtain (k,s) Eke m L X (k−1,s) s−1 k≤ − x̃ k . kxi − x̄(k−1,s) k +2kx̃s−1 i m i=1 | {z } (30b) (31) The first term is bounded in the following way. m m m L X (k−1,s) L X X (k−1,s) (k−1,s) kxi − x̄(k−1,s) k ≤ 2 kx − xj k m i=1 m i=1 j=1 i m m L XX ≤ 2 kq̂i (k−1,s) − qˆj (k−1,s) k m i=1 j=1 m ≤ 2L X kq̂i (k−1,s) − q̄(k−1,s) k m i=1 m m m X X 2L X (k−1,s) (k−1,s) k k ≤ (Γγ kqj k) = 2LΓγ kqi k, m i=1 j=1 i=1 (32) where the second inequality uses nonexpansiveness of proximal operator (Lemma 4) and the last inequality uses Pm Pm (k,s) (k,s) (k,s) (k,s) 1 1 (k,s) · kqj k ≤ ≤ j=1 φi,j − m Lemma 1 with the inequalities kq̂i (k,s) − q̄(k,s) k = −m qj j=1 φi,j qj Pm (k,s) Γγ k j=1 kqj k. Similarly, we give rise to m m m 2L X s−1 2L X X s−1 kx̃i − x̃s−1 k ≤ 2 kx̃ − x̃js−1 k m i=1 m i=1 j=1 i m K K k=1 k=1 m X m X Ks X m = ≤ s s 2L X X 1 X 1 X (k,s−1) (k,s−1) k x − xj k i m2 i=1 j=1 Ks Ks 2L 1 m2 i=1 j=1 Ks 1 = 4LΓ Ks Ks X k=1 γ k (k,s−1) kxi (k,s−1) − xj k k=1 m X kqik,s−1 k. (33) i=1 (k,s) The third inequality utilizes the similar technique in (32). Finally, we bound Eke(k,s) k with the sum of kqi k, i.e., Ks m m X X 4LΓ X (k−1,s) (34) Eke(k,s) k ≤2LΓγ k kqi k+ γk kqk,s−1 k. i K s i=1 i=1 k=1 √ (k,s) Subsequently, we bound ε from (10b) using the similar techniques applied in disposing e(k,s) . With y(k,s) = α (k,s) proxh {q̄ }, we deal with the first term in (10b) in the following way. 2 m m 1 1 1 X (k,s) 1 X (k,s) (k,s) (k,s) 2 kx̄ −y k = x − y 2α 2α m i=1 i m i=1 !2 m 1 1 X (k,s) (k,s) ≤ kx −y k 2α m i=1 i !2 m 1 1 X (k,s) (k,s) ≤ kq̂i − q̄ k 2α m i=1 2 !2 m m m X X X 1 1 1 (k,s) (k,s) k k ≤ Γγ kqj k = Γγ kqi k . (35) 2α m i=1 2α j=1 i=1 The second term of (10b) is handled in the following way. 1 hx̄(k,s) − y(k,s) , (y(k,s) − q̄(k,s) ) + pi α 1 ≤ kx̄(k,s) − y(k,s) k · k (y(k,s) − q̄(k,s) ) + pk α 13 ≤ 2Gh kx̄(k,s) − y(k,s) k m 1 X ≤ 2Gh kq̂i (k,s) − q̄(k,s) k m i=1 m m m X 1 X k X (k,s) (k,s) k Γγ kqj k = 2Gh Γγ kqi k, ≤ 2Gh m i=1 j=1 i=1 (36) where the second inequality comes from kq̄(k,s) − y(k,s) k ≤ αGh by applying Assumption 7. Combining (35) and √ (k,s) (36), ε is bounded by v !2 u m m p u 1 X X (k,s) (k,s) t (k,s) k k ε ≤ Γγ kqi k + 2Gh Γγ kqi k 2α i=1 i=1 v u m m X X u 1 (k,s) (k,s) kqi k. (37) ≤ Γγ k kqi k + t2Gh Γγ k 2α i=1 i=1 √ In what follows, we will show that Eke(k,s) k and ε(k,s) satisfy Assumption 6, i.e., they are summable over k. By introducing the following proposition 3, two error sequences can be bounded by terms related to C0 , C1 and C2 , and thus their summations over k are finite. Pm (1,1) Proposition 3 Consider Algorithm 1. Let C0 = < ∞, C1 = αm(Gg + Gh ) < ∞, C2 = i=1 qi Pm (k,s) s αmβ n0 (Gg + Gh ). Then, ∀k, s, the term i=1 kqi k can be bounded by C0 + C1 k + C2 s, (C0 , C1 , C2 ≥ 0), i.e. m X (k,s) kqi k ≤ C0 + C1 k + C2 s. (38) i=1 Applying Proposition 3 to (34) and (37), we have Ks X Eke(k,s) k ≤ Ks X 2LΓγ k (C0 + C1 (k − 1) + Cs s) k=1 k=1 Ks X K s 1 X γ k (C0 + C1 k + Cs (s − 1)) Ks k=1 k=1 2 ≤2LΓ D0 (C0 − C1 + Cs s) + C1 D12 + 4LΓ D02 (C0 + Cs (s − 1)) + C1 D12 , + 4LΓ (39a) and Ks p X ε(k,s) ≤ k=1 Ks X 1 k=1 + 2α Γγ k (C0 + C1 k + Cs s) Ks q X 2Gh Γγ k (C0 + C1 k + Cs s) k=1 ≤ Γ (D0 (C0 + C2 s) + C1 D1 ) 2α p p p + 2Gh Γ D0 C0 + C2 s + C1 D1 , (39b) P∞ p P∞ p P∞ P∞ with some constants D0 ≥ k=1 γ k and D1 ≥ k=1 γ k k. Then there exists k=1 γ k ≤ D02 and k=1 γ k k ≤ D12 . √ Equations (39a) and (39b) are finite with regard to k for a given s. Therefore, the error sequences El ke(k,s) k and ε(k,s) in Theorem 1 satisfy Assumption 6. Consequently, Theorem 2 also holds, and we in turn conclude that DPSVRG converges to the optimum with the same order as Inexact Prox-SVRG. 5 Experimental Results 5.1 Evaluation Setup Testbed Setting and Datasets. Our experiments are conducted on a testbed of 8 servers, each with an 8-core Intel Xeon W2140b CPU at 3.20GHz, a 32GB DDR4 RAM and a Mellanox Connect-X3 NIC supporting 14 Table 1: Property of datasets Datasets MNIST CIFAR-10 Adult Covertype (a) MNIST train size 60,000 50,000 30161 100,000 feature dimension 784 1,024 30 54 (b) CIFAR-10 class 10 10 2 7 (c) Adult (d) Covertype Figure 1: Training loss of DPSVRG and DSPG 10 GbE links. We compare the performance of DPSVRG with the baseline DSPG [11] on four typical datasets as listed in Table 1. We consider a decentralized logistic regression model with an `1 norm for binary classification problems with data labels {0, 1}. The global objective across nodes is expressed as m n i 1 XX i i −bl hdl , xi + log(1 + ehdl ,xi ) + λkxk1 . min mn i=1 (40) l=1 where dil and bil denote the feature and target class of data ζli respectively. This objective function satisfies Assumptions 3 and 4. For simplicity, all the data is equally partitioned to 8 nodes and each node holds n data samples. If not specified, the graph connectivity coefficient b is set to 1, implying that the network is connected at every step. The impact of b on the convergence performance will also be evaluated in 5.4. 5.2 Basic Results. We first set α = 0.01 and λ = 0.01 and carry out several experiments of DPSVRG and DSPG on four datasets. The results are displayed in Fig. 1, Fig. 2 and Fig. 3, where the performance is measured by the optimality gap, i.e., F (xs ) − F (x∗ ). The horizontal coordinate in Fig. 1 and Fig. 3 represents the effective passes through the whole dataset. We execute the centralized gradient method to approximate F (x∗ ). Faster convergence rate. In Fig. 1, the two algorithms converge at the same rate at the beginning, but DSPG oscillates after some epochs and is trapped in a neighborhood of x∗ in the end, which is known as the inexact convergence. This is due to the large variance of stochastic gradients that lead the parameters to inaccurate stages. One approach to tackle this problem is to adjust the learning rate during training, yet this process is laborious and results in lower convergence rate. On the contrary, DPSVRG converges smoothly while DSPG always oscillates in the MNIST and CIFAR-10 datasets. The convergence rate of DPSVRG is much faster than DSPG in terms of the optimality gap. This advantage is shown in four various datasets. The smoothness of DPSVRG provides us a consistent relationship between the performance and training epochs and can guide us to reasonably control the amount of training rounds. In summary, DPSVRG allows maintaining a constant learning rate to achieve a high and steady convergence speed. Communication-efficiency of multi-consensus. Multi-consensus is used in DPSVRG to speed up the convergence rate by executing communication for multiple rounds in one inner update. One natural question is whether this mechanism raises the communication cost. To illustrate the communication-efficiency of our approach, we compare the optimality gap of two algorithms versus communication rounds in Fig. 2. The inexact convergence property of DSPG is more evident and cannot be eliminated by increasing the communication rounds. It is clear that to reach the optimal point, DPSVRG demands less communication cost than DSPG from a global view. We then compare DPSVRG and that embedded with single-consensus to further illustrate the effectiveness of multi-consensus. Fig. 3 shows that DPSVRG with single-consensus converges a little slower than DPSVRG in terms of the training rounds. Furthermore, by comparing DSPG in Fig. 1 and DPSVRG with single-consensus, we observe that DPSVRG converges faster and more smoothly, which demonstrates the effectiveness of variance reduction technique. 15 (a) MNIST (b) CIFAR-10 (c) Adult (d) Covertype Figure 2: Training loss of DPSVRG and DSPG with the same communication rounds (a) MNIST (b) CIFAR-10 (c) Adult (d) Covertype Figure 3: Training loss of single-consensus and multi-consensus of DPSVRG 5.3 Robustness Towards Regularization Coefficient In order to evaluate the influence of the value of `1 on the performance of DPSVRG, we take another experiment by adjusting λ from 0.001 to 0.1 and fixing α = 0.01. A large λ will cause an optimal point with sparser elements. The objective functions with varied λ are different from each other and the optimal value of these functions are also diverse. Therefore, we use the global loss as the performance metric. From the results in Fig. 4, we observe that the value of λ has very gentle influence on the stability of DPSVRG. On the contrary, different values of λ have a great impact on the behavior of DSPG. In particular, when λ is larger (λ = 0.1), the oscillation in DSPG tends to be severer (about 2 × 10−3 ) and the final result resides at a relatively high training loss level (around 3.965 × 10−1 ). In other words, DPSVRG can maintain satisfactory stability with changing λ values than its baseline counterpart. 5.4 Investigating Graph Connectivity. We study how the performance of our algorithm changes with varied graph connectivity. More specifically, we provide three values of b: 3, 7 and 50, where larger value implies sparser graph. With each specific value, a set of b doubly stochastic matrices are pre-determined, satisfying that only the union of all b matrices is connected. Matrices are sampled periodically in the training process, thus the graph G is b-connected. Fig. 5 shows the optimality gap of DSPG and DPSVRG versus the outer iteration rounds on MNIST dataset. On the sparser communication graph, both two algorithms converge slower and their performance gap becomes wider. This result can be clearly observed if we zoom in on the optimality gap over the last 50 iterations. The performance difference between two approaches increases from 1×10−3 to 2×10−3 with b changing from 3 to 50. In addition, DSPG over sparser graph has more violent oscillations and is trapped in the neighborhood farther from the optimal point (around 1.5 × 10−3 when b = 50 compared to 0.9−3 when b = 3). Meanwhile, the sparsity of communication graphs only influences the convergence speed of DPSVRG, but does not prevent it converging to the optimum. This robustness to different sparsity degrees of the graph improves the practicability of DPSVRG in sparser and time-varying networks. 6 Conclusion In this paper, we propose DPSVRG to efficiently speedup the convergence of the decentralized stochastic proximal descent algorithm. DPSVRG leverages the variance reduction technique to reduce the variance of the gradient estimator and rectify the imprecise update direction. We transform DPSVRG into its centralized equivalent that is the inexact Prox-SVRG algorithm under mild conditions. We prove that DPSVRG and Inexact Prox-SVRG have an O( T1 ) convergence rate for a general smooth and convex objective function plus a non-smooth regularization term, while DSPG merely converges at an O( √1T ) rate. Experimental results demonstrate that DPSVRG converges much faster than DSPG with different time-varying topologies and different regularization coefficients. Especially, the training loss of DPSVRG decreases smoothly as the training epoch moves on, while that of DSPG exhibits obvious fluctuations. 16 Figure 4: Training loss with different regularizer coeffi- Figure 5: Training loss with different graph connectivity cients (λ). coefficients (b). A Proof of Theorem 1 Pm 0 1 (k,1) = x̄(k,1) , ∀k ∈ [1, K1 ] In the initial state, we have x̄(0,1) = xinit = x(0,1) and x̃0 = m i=1 x̃i . Hence, if x can be satisfied, we have K1 K1 m 1 X 1 X 1 1 X (k,1) (k,1) 1 x = x̄ = x̃ = x̃ , K1 K1 m i=1 i k=1 x(0,2) = x(K1 ,1) = x̄(K1 ,1) = k=1 m X 1 x(K1 ,1) = x̄(0,2) . m i=1 Pm 1 In this way, x and x̃ successfully track x̄ and m i=1 x̃1 in the first inner loop. By repeating the update rules in Algorithm 2 for S rounds, the two algorithms will reach the same optimal point x̃S . Clearly, given s, an explicit method to guarantee x(k,s) = x̄(k,s) , ∀k is to ensure q(k,s) = q̄(k,s) . Therefore, in the following we use equations x(k,s) = x̄(k,s) and q(k,s) = q̄(k,s) to deduce the required expression of error terms e(k,s) and ε(k,s) . (k,s) Consider the updates of qi (lines 7 ∼ 8, Algorithm 1) and q(k,s) (lines 7 ∼ 8, Algorithm 2), respectively. (k,s) Taking the average of qi over i yields the expression of q̄(k,s) as below. m 1 X (k−1,s) (k,s) (k−1,s) q̄ =x̄ −α ∇fili (xi ) m i=1 − m m 1 X 1 X ∇fili (x̃is−1 ) + ∇fi (x̃s−1 ) . i m i=1 m i=1 The error term e(k,s) in (10a) is generated from the difference of q̄(k,s) and q(k,s) . Next we derive ε(k,s) by starting with the inexact proximal operator in line 9 of Algorithm 2. According to (9), we have 1 kx(k,s) − q̄(k,s) k2 + h(x(k,s) ) 2α (41) 1 (k,s) (k,s) 2 (k,s) (k,s) ≤ ky − q̄ k + h(y )+ε , 2α (k,s) with y(k,s) = proxα }, which implies (with Lemma 3) h {q̄ 1 (k,s) (q̄ − y(k,s) ) ∈ ∂h(y(k,s) ). α (42) We use the updates in line 9 ∼ 10 of Algorithm 1 to construct the inequality similar to (41). 1 kx̄(k,s) − q̄(k,s) k2 + h(x̄(k,s) ) 2α 1 1 = kx̄(k,s) − y(k,s) k2 + ky(k,s) − q̄(k,s) k2 + h(x̄(k,s) ) 2α 2α 1 + hx̄(k,s) − y(k,s) , y(k,s) − q̄(k,s) i α 1 1 ≤ ky(k,s) − q̄(k,s) k2 + h(y(k,s) ) + kx̄(k,s) − y(k,s) k2 2α 2α 1 + hx̄(k,s) − y(k,s) , y(k,s) − q̄(k,s) i − hp, y(k,s) − x̄(k,s) i, α (43) where p is a vector with p ∈ ∂h(x̄(k,s) ) and the inequality holds due to the convexity of h. Comparing (43) with (41), we will get the form of ε(k,s) in (10b), thus completing the proof of Theorem 1. 17 B Proofs of Lemmas B.1 Proof of Lemma 5 Given the definition of ∂ε h(z) in (12), we note that if convex function h can be divided into two convex parts h = h1 + h2 , then ∂ε h(z) ⊂ ∂ε h1 (z) + ∂ε h2 (z). 1 2 Applying (12) to x = proxα h,ε {z}, we obtain 0 ∈ ∂ε { 2α kx − zk + h(x)}. Therefore, there exists t and −t, so 1 that t ∈ ∂ε { 2α kx − zk2 } and −t ∈ ∂ε h(x). According to the definition of ε-subdifferential, we have that ∀y, 1 1 ky − zk2 ≥ kx − zk2 + ht, y − xi − ε. 2α 2α 1 Unfolding 2α ky − zk2 and applying −hy, zi ≤ kyk · kzk and ht, yi ≥ −kyk · ktk yields a quadratic inequality in terms of y. To make sure this inequality holds for ∀y, we let the discriminant nonpositive (∆ ≤ 0), i.e., 2 2 1 1 1 2 2 kzk + ktk ≤ − kx − zk − ht, xi − kzk − ε . α α 2α 2α Simplifying this equation with α2 kzk · ktk ≥ − α2 hz, ti produces 1 kx − (z + αt)k2 ≤ ε. 2α Defining g = x − (z + αt), Lemma 5 readily follows. B.2 Proof of Lemma 6 α Let x1 = proxα h,ε1 {z1 }, x2 = proxh,ε2 {z2 }. According to Lemma 5, we have √ z1 + g1 − x1 , p1 ∈ ∂ε1 h(z1 ), kg1 k ≤ 2αε1 , α √ z2 + g2 − x2 p2 = , p2 ∈ ∂ε2 h(z2 ), kg2 k ≤ 2αε2 . α p1 = With the property of ε-subdifferential, we obtain that ∀y, h(y) ≥ h(x1 ) + hp1 , y − x1 i − ε1 , (44a) h(y) ≥ h(x2 ) + hp2 , y − x2 i − ε2 . (44b) We replace y with x2 in (44a) and with x1 in (44b) and get 1 hz2 + g2 − x2 , x1 − x2 i − ε2 , α 1 h(x2 ) ≥ h(x1 ) + hz1 + g1 − x1 , x2 − x1 i − ε1 . α h(x1 ) ≥ h(x2 ) + Adding up these two inequalities leads to hz2 − z1 + g2 − g1 , x1 − x2 i + kx1 − x2 k2 − α(ε1 + ε2 ) ≤ 0. Therefore, kx1 − x2 k p kg1 − g2 + z1 − z2 k2 + 4α(ε1 + ε2 ) ≤ 2p √ √ ≤kz1 − z2 k + 2αε1 + 2αε2 + α(ε1 + ε2 ) kg1 − g2 + z1 − z2 k + We complete the proof by taking ε1 = ε2 = ε. 18 B.3 Proof of Lemma 7 The proof of Lemma 7 is analogous to most of the SVRG literature with some needed differences. Elin kv (k) − ∇f (x(k−1) )k2 = Elin ∇f lin (x(k−1) ) − ∇f lin (x̃s−1 ) + ∇f (x̃s−1 ) − ∇f (x(k−1) ) 2 ≤ Elin k∇f lin (x(k−1) ) − ∇f lin (x̃s−1 )k2 ≤ 2Elin k∇f lin (x(k−1) ) − ∇f lin (x∗ )k2 lin s−1 lin ∗ (45) 2 + 2Elin k∇f (x̃ ) − ∇f (x )k m 2 X ≤ El k∇f l (x(k−1) ) − ∇f l (x∗ )k2 m + l=1 m X 2 m El k∇f l (x̃s−1 ) − ∇f l (x∗ )k2 l=1 The first inequality follows from Var[x] ≤ E[x2 ] and notation l in the last formula refers to the sample l. Now let φl (x) = f l (x) − f l (x∗ ) − hf l (x∗ ), x − x∗ i. Clearly, the convexity of f l (x) leads to φl (x) ≥ 0. Then we acquire 0 = minx f l (x) ≤ minx,µ f l x − µ∇f l (x) L ≤ minx,µ f l (x) + h∇f l (x), −µ∇f l (x)i + k − µ∇f l (x)k2 2 (46) L = minx,µ f l (x) − µk∇f l (x)k2 + µ2 k∇f l (x)k2 2 1 = minx f l (x) − k∇f l (x)k2 , 2L which means k∇f l (x)k2 ≤ 2Lf l (x), or in other words, k∇f l (x) − ∇f l (x∗ )k2 ≤ 2L(f l (x) − f l (x∗ ) − h∇f l (x∗ ), x − x∗ i). Assume that n data samples are i.i.d, then we have n El k∇f l (x) − ∇f l (x∗ )k2 = 1X k∇f j (x) − ∇f j (x∗ )k2 n j=1 n n n 1X j 1X j ∗ 1X ≤2L( f (x) − f (x ) − h∇f j (x∗ ), x − x∗ i) n j=1 n j=1 n j=1 (47) ≤2L(f (x) − f (x∗ ) − h∇f (x∗ ), x − x∗ i) =2L(f (x) − f (x∗ ) + hu, x − x∗ i) ≤2L(f (x) − f (x∗ ) + h(x) − h(x∗ )) =2L(F (x) − F (x∗ )), Pn where f (x) = n1 j=1 f j (x), F (x) = f (x) + h(x) and u ∈ ∂h(x∗ ), which satisfies ∇f (x∗ ) + u = 0, where x∗ denotes the optimal point of (P2). The last inequality is due to the convexity of h(x). We obtain Lemma 7 by substituting (47) into (45). B.4 Proof of Lemma 8 It is obvious that the result is true for K = 0. By induction technique, if (14) is true for K − 1, then we need to prove it is also true for K. From the recursion, we successively attain the following inequalities. u2K ≤ SK + K X λk uk = SK + k=1 K−1 X λk uk + λK uK (48a) λk uk (48b) k=1 u2K − λK uK ≤ SK + K−1 X k=1 K−1 X 1 1 λk uk + λ2K (uK − λK )2 ≤ SK + 2 4 k=1 19 (48c) 1 uK ≤ λK + 2 SK + K−1 X k=1 1 λk uk + λ2K 4 ! 12 (48d) Let dK−1 = max{u0 , u1 , ..., uK−1 }. With uK−1 ≤ dK−1 , (48d) implies 1 uK ≤ λK + 2 K−1 X SK + dK−1 k=1 1 λk + λ2K 4 ! 12 , which in turn leads to dK = max{dK−1 , uK } 1 ≤ max dK−1 , λK + 2 | SK + dK−1 K−1 X k=1 1 λk + λ2K 4 ! 12 . (49) } {z Define the second term in max operator as ∆. We then analyze (49) in three situations. (a) If dK−1 = ∆, we 12 PK PK get d∗K−1 = 21 k=1 λk + SK + ( 21 k=1 λk )2 and uK ≤ dK ≤ d∗K−1 . (b) If dK−1 < ∆, then dK−1 < d∗K−1 , leading to uK ≤ dK ≤ ∆. Taking dK−1 as the variable, we obtain ∆|dK−1