TY - GEN
T1 - Publishing differentially private medical events data
AU - Shaked, Sigal
AU - Rokach, Lior
N1 - Funding Information:
This work was supported in part by Deustche Telekom Labs.
Publisher Copyright:
© IFIP International Federation for Information Processing 2016.
PY - 2016/8/23
Y1 - 2016/8/23
N2 - Sequential data has been widely collected in the past few years; in the public health domain it appears as collections of medical events such as lab results, electronic chart records, or hospitalization transactions. Publicly available sequential datasets for research purposes promises new insights, such as understanding patient types, and recognizing emerging diseases. Unfortunately, the publication of sequential data presents a significant threat to users’ privacy. Since data owners prefer to avoid such risks, much of the collected data is currently unavailable to researchers. Existing anonymization techniques that aim at preserving sequential patterns lack two important features: handling long sequences and preserving occurrence times. In this paper, we address this challenge by employing an ensemble of Markovian models trained based on the source data. The ensemble takes several optional periodicity levels into consideration. Each model captures transitions between times and states according to shorter parts of the sequence, which is eventually reconstructed. Anonymity is provided by utilizing only elements of the model that guarantee differential privacy. Furthermore, we develop a solution for generating differentially private sequential data, which will bring us one step closer to publicly available medical datasets via sequential data. We applied this method to two real medical events datasets and received some encouraging results, demonstrating that the proposed method can be used to publish high quality anonymized data.
AB - Sequential data has been widely collected in the past few years; in the public health domain it appears as collections of medical events such as lab results, electronic chart records, or hospitalization transactions. Publicly available sequential datasets for research purposes promises new insights, such as understanding patient types, and recognizing emerging diseases. Unfortunately, the publication of sequential data presents a significant threat to users’ privacy. Since data owners prefer to avoid such risks, much of the collected data is currently unavailable to researchers. Existing anonymization techniques that aim at preserving sequential patterns lack two important features: handling long sequences and preserving occurrence times. In this paper, we address this challenge by employing an ensemble of Markovian models trained based on the source data. The ensemble takes several optional periodicity levels into consideration. Each model captures transitions between times and states according to shorter parts of the sequence, which is eventually reconstructed. Anonymity is provided by utilizing only elements of the model that guarantee differential privacy. Furthermore, we develop a solution for generating differentially private sequential data, which will bring us one step closer to publicly available medical datasets via sequential data. We applied this method to two real medical events datasets and received some encouraging results, demonstrating that the proposed method can be used to publish high quality anonymized data.
KW - Clustering
KW - Data synthetization
KW - Differential privacy
KW - Markov model
KW - Medical events
KW - Privacy preserving data publishing
KW - Sequential patterns
UR - http://www.scopus.com/inward/record.url?scp=84984852101&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-45507-5_15
DO - 10.1007/978-3-319-45507-5_15
M3 - Conference contribution
AN - SCOPUS:84984852101
SN - 9783319455068
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 219
EP - 235
BT - Availability, Reliability, and Security in Information Systems - IFIP WG 8.4, 8.9, TC 5 International Cross-Domain Conference, CD-ARES 2016 and Workshop on Privacy Aware Machine Learning for Health Data Science, PAML 2016 Salzburg, Proceedings
A2 - Kieseberg, Peter
A2 - Weippl, Edgar
A2 - Holzinger, Andreas
A2 - Buccafurri, Francesco
A2 - Tjoa, A. Min
PB - Springer Verlag
T2 - IFIP WG 8.4, 8.9, TC 5 International Cross-Domain Conference, CD-ARES 2016 and Workshop on Privacy Aware Machine Learning for Health Data Science, PAML 2016
Y2 - 31 August 2016 through 2 September 2016
ER -