On the optimality of averaging in distributed statistical learning

Jonathan D. Rosenblatt, Boaz Nadler

Research output: Contribution to journalArticlepeer-review

34 Scopus citations


A common approach to statistical learning with Big-data is to randomly split it among m machines and learn the parameter of interest by averaging the m individual estimates. In this paper, focusing on empirical risk minimization or equivalently M-estimation, we study the statistical error incurred by this strategy.We consider two large-sample settings: First, a classical setting where the number of parameters p is fixed, and the number of samples per machine n→∞. Second, a high-dimensional regime where both p, n→∞ with p/n→κ ϵ (0, 1). For both regimes and under suitable assumptions, we present asymptotically exact expressions for this estimation error. In the fixed-p setting, we prove that to leading order averaging is as accurate as the centralized solution. We also derive the second-order error terms, and show that these can be non-negligible, notably for nonlinear models. The high-dimensional setting, in contrast, exhibits a qualitatively different behavior: Data splitting incurs a first-order accuracy loss, which increases linearly with the number of machines. The dependence of our error approximations on the number of machines traces an interesting accuracy-complexity tradeoff, allowing the practitioner an informed choice on the number of machines to deploy. Finally, we confirm our theoretical analysis with several simulations.

Original languageEnglish
Pages (from-to)379-404
Number of pages26
JournalInformation and Inference
Issue number4
StatePublished - 1 Dec 2016


  • Big-data
  • Distributed algorithms
  • Empirical risk minimization
  • High-order asymptotics
  • M-estimation
  • Machine learning

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Analysis
  • Applied Mathematics
  • Statistics and Probability
  • Numerical Analysis


Dive into the research topics of 'On the optimality of averaging in distributed statistical learning'. Together they form a unique fingerprint.

Cite this