A Unified Momentum-based Paradigm of Decentralized SGD for Non-Convex Models and Heterogeneous Data

A Unified Momentum-based Paradigm for Decentralized SGD for Non-Convex Models and Heterogeneous Data

  1. Research Background

In recent years, with the rise of the Internet of Things and edge computing, distributed machine learning has developed rapidly, especially the decentralized training paradigm. However, in practical scenarios, non-convex objective functions and data heterogeneity have become two major bottlenecks limiting the efficiency and performance of distributed training.

Non-convex optimization objective functions are widely present in deep learning models, which may have multiple local optima, leading to problems such as decreased model accuracy and unstable training processes. Meanwhile, in distributed environments, the data distributions held by the participating nodes differ (i.e., heterogeneity), and this data bias adversely affects convergence and generalization performance, posing another urgent challenge to be addressed.

  1. Paper Source

This paper was published in the renowned journal Artificial Intelligence, Issue 332, 2024, by authors from the School of Computer Science and Technology, Shanghai University of Electric Power.

  1. Research Work

3.1 Overall Framework

The authors propose a unified momentum-based paradigm (UMP) that includes two decentralized SGD algorithms: D-Sum and GT-DSum. The former provides convergence guarantees for solving non-convex optimization problems, while the latter introduces Gradient Tracking (GT) techniques on the basis of D-Sum to mitigate the impact of data heterogeneity.

3.2 D-Sum Algorithm

Algorithm Flow: 1) Initialize model parameters and momentum buffers at each node 2) In each iteration, each node performs K model updates (e.g., SGD or momentum SGD) based on local data 3) After K updates, nodes perform a model averaging aggregation 4) Proceed to the next iteration

Algorithm Innovation: A unified momentum update equation is introduced, covering classical momentum methods (e.g., Polyak’s momentum and Nesterov acceleration), and non-convex convergence analysis is provided.

3.3 GT-DSum Algorithm

Algorithm Flow: 1) Initialize model parameters, momentum buffers, and gradient trackers at each node 2) In each iteration, correct local gradients using the trackers and perform K model updates 3) After K updates, aggregate models, momentum buffers, and trackers across nodes 4) Update the difference term of the tracker and proceed to the next iteration

Algorithm Innovation: Based on D-Sum, gradient tracking techniques are integrated to make each node’s update direction gradually converge towards the global optimization direction, thereby mitigating the impact of data heterogeneity.

3.4 Theoretical Analysis For both algorithms, the authors rigorously derive the convergence upper bounds for non-convex, smooth distributed optimization problems, which are comparable to the convergence rates of classical SGD. Notably, the convergence bound of GT-DSum depends only on the initial data bias and not on the degree of data heterogeneity throughout the training process.

  1. Experimental Section

To evaluate the practical performance of the methods, the authors conducted extensive experiments on common models, datasets, and dynamic environments. The results show that, compared to existing decentralized baseline methods, D-Sum and GT-DSum can improve model accuracy by up to 35.8% and 57.6%, respectively, for varying degrees of non-independent and identically distributed data. Among them, GT-DSum demonstrates superior generalization performance in addressing data bias problems.

  1. Research Significance

The main innovation of this work lies in proposing a unified momentum-based paradigm (UMP) that simultaneously addresses non-convex optimization and data heterogeneity problems. Compared to existing methods, the UMP paradigm has the following advantages:

1) Unification: By adjusting parameters, it can cover various classical momentum methods such as Heavy Ball and Nesterov Acceleration.

2) Theoretical Guarantees: It provides the first convergence analysis for momentum methods in distributed non-convex optimization problems.

3) Practical Performance: It significantly improves model accuracy and robustness in heterogeneous data environments.

4) Paradigm Innovation: It provides new insights for designing new algorithms for distributed non-convex optimization and data bias problems.

This work provides new theoretical foundations and a systematic algorithmic framework for efficiently addressing two core challenges in distributed machine learning.