Abstract
A common approach to statistical learning with Big-data is to randomly split it among m machines and learn the parameter of interest by averaging the m individual estimates. In this paper, focusing on empirical risk minimization or equivalently M-estimation, we study the statistical error incurred by this strategy.We consider two large-sample settings: First, a classical setting where the number of parameters p is fixed, and the number of samples per machine n→∞. Second, a high-dimensional regime where both p, n→∞ with p/n→κ ϵ (0, 1). For both regimes and under suitable assumptions, we present asymptotically exact expressions for this estimation error. In the fixed-p setting, we prove that to leading order averaging is as accurate as the centralized solution. We also derive the second-order error terms, and show that these can be non-negligible, notably for nonlinear models. The high-dimensional setting, in contrast, exhibits a qualitatively different behavior: Data splitting incurs a first-order accuracy loss, which increases linearly with the number of machines. The dependence of our error approximations on the number of machines traces an interesting accuracy-complexity tradeoff, allowing the practitioner an informed choice on the number of machines to deploy. Finally, we confirm our theoretical analysis with several simulations.
Original language | English |
---|---|
Pages (from-to) | 379-404 |
Number of pages | 26 |
Journal | Information and Inference |
Volume | 5 |
Issue number | 4 |
DOIs | |
State | Published - 1 Dec 2016 |
Keywords
- Big-data
- Distributed algorithms
- Empirical risk minimization
- High-order asymptotics
- M-estimation
- Machine learning
ASJC Scopus subject areas
- Analysis
- Statistics and Probability
- Numerical Analysis
- Computational Theory and Mathematics
- Applied Mathematics