Abstract:
Due to the limited resources and data privacy issue, last decade witnesses the fast development of Distributed Machine Learning (DML) at network edges. Among all the existing DML paradigms, Federated Learning (FL) would be a promising one, since in FL, each client trains its local model without sharing the raw data with others. A community of clients with the same interest can join together to derive a high-performance model by periodically synchronizing the parameters of their local models under the help of a coordination server. However, FL will encounter the straggler problem at network edges, and hence the synchronization among clients becomes inefficient. It slows down the convergence speed of learning process. To alleviate the straggler problem, we propose a method named Chronos that accelerates FL with training volume tuning in this paper. More specifically, Chronos is a resource aware method that adaptively adjusts the amount of data used by each client for training (i.e., training volume) in each iteration in order to eliminate the synchronization waiting time caused by the heterogeneous and dynamical computing and communication resources. In addition, we theoretically analyze the convergence of Chronos in a non-convex setting and utilize the results for the algorithm design of Chronos in return to guarantee the convergence. Extensive experiments show that compared with the benchmark algorithms (i.e., BSP and SSP), Chronos significantly improves convergence speed by up to 6.4×.