## Abstract

Data is continuously generated by modern data sources, and a recent challenge in machine learning has been to develop techniques that perform well in an incremental (streaming) setting. A variety of offline machine learning tasks are known to be feasible under differential privacy, where generic construction exist that, given a large enough input sample, perform tasks such as PAC learning, Empirical Risk Minimization (ERM), regression, etc. In this paper, we investigate the problem of private machine learning, where as common in practice, the data is not given at once, but rather arrives incrementally over time. We introduce the problems of private incremental ERM and private incremental regression where the general goal is to always maintain a good empirical risk minimizer for the history observed under differential privacy. Our first contribution is a generic transformation of private batch ERM mechanisms into private incremental ERM mechanisms, based on a simple idea of invoking the private batch ERM procedure at some regular time intervals. We take this construction as a baseline for comparison. We then provide two mechanisms for the private incremental regression problem. Our first mechanism is based on privately constructing a noisy incremental gradient function, which is then used in a modified projected gradient procedure at every timestep. This mechanism has an excess empirical risk of ≈ √d, where d is the dimensionality of the data. While from the results of Bassily et al. [2] this bound is tight in the worst-case, we show that certain geometric properties of the input and constraint set can be used to derive significantly better results for certain interesting regression problems. Our second mechanism which achieves this is based on the idea of projecting the data to a lower dimensional space using random projections, and then adding privacy noise in this low dimensional space. The mechanism overcomes the issues of adaptivity inherent with the use of random projections in online streams, and uses recent developments in high-dimensional estimation to achieve an excess empirical risk bound of ≈ T^{1/3}W^{2/3}, where T is the length of the stream and W is the sum of the Gaussian widths of the input domain and the constraint set that we optimize over.

Original language | English |
---|---|

Title of host publication | PODS 2017 - Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems |

Publisher | Association for Computing Machinery |

Pages | 167-182 |

Number of pages | 16 |

ISBN (Electronic) | 9781450341981 |

DOIs | |

State | Published - 9 May 2017 |

Externally published | Yes |

Event | 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017 - Chicago, United States Duration: 14 May 2017 → 19 May 2017 |

### Publication series

Name | Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems |
---|---|

Volume | Part F127745 |

### Conference

Conference | 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017 |
---|---|

Country/Territory | United States |

City | Chicago |

Period | 14/05/17 → 19/05/17 |

## ASJC Scopus subject areas

- Software
- Information Systems
- Hardware and Architecture