TY - JOUR
T1 - Subpopulation-specific synthetic electronic health records can increase mortality prediction performance
AU - Perets, Oriel
AU - Rappoport, Nadav
N1 - Publisher Copyright:
© 2025 The Author(s). Published by Oxford University Press on behalf of the American Medical Informatics Association.
PY - 2025/8/1
Y1 - 2025/8/1
N2 - Objective To address biased representation in Electronic Health Records (EHRs) across subpopulations (SPs), which leads to predictive models underperforming for underrepresented groups, we propose a framework to enhance equitable predictive performance. Materials and Methods We developed a framework using generative adversarial networks (GANs) to create SP-specific synthetic data, which augments the original training datasets. Subsequently, we employed an ensemble approach, training distinct prediction models tailored to each SP. Results The proposed framework was evaluated on two datasets derived from the MIMIC database, achieving a performance improvement in Receiver Operating Characteristics Area Under Curve (ROCAUC) ranging from 8% to 31% for underrepresented SPs. Discussion The results indicate that targeted synthetic data augmentation and SP-specific model training significantly mitigate the performance disparities observed in conventional predictive models trained on imbalanced EHR data. Conclusion Our novel GAN-based framework, combined with an ensemble prediction approach, effectively enhances predictive equity across SPs. The code and ensemble models developed in this study are publicly available, supporting further research and practical adoption of equitable predictive analytics in healthcare.
AB - Objective To address biased representation in Electronic Health Records (EHRs) across subpopulations (SPs), which leads to predictive models underperforming for underrepresented groups, we propose a framework to enhance equitable predictive performance. Materials and Methods We developed a framework using generative adversarial networks (GANs) to create SP-specific synthetic data, which augments the original training datasets. Subsequently, we employed an ensemble approach, training distinct prediction models tailored to each SP. Results The proposed framework was evaluated on two datasets derived from the MIMIC database, achieving a performance improvement in Receiver Operating Characteristics Area Under Curve (ROCAUC) ranging from 8% to 31% for underrepresented SPs. Discussion The results indicate that targeted synthetic data augmentation and SP-specific model training significantly mitigate the performance disparities observed in conventional predictive models trained on imbalanced EHR data. Conclusion Our novel GAN-based framework, combined with an ensemble prediction approach, effectively enhances predictive equity across SPs. The code and ensemble models developed in this study are publicly available, supporting further research and practical adoption of equitable predictive analytics in healthcare.
KW - electronic health records
KW - generative adversarial networks
KW - mortality prediction
KW - subpopulation health
KW - synthetic data
UR - https://www.scopus.com/pages/publications/105013151197
U2 - 10.1093/jamiaopen/ooaf091
DO - 10.1093/jamiaopen/ooaf091
M3 - Article
C2 - 40799931
AN - SCOPUS:105013151197
SN - 2574-2531
VL - 8
JO - JAMIA Open
JF - JAMIA Open
IS - 4
M1 - ooaf091
ER -