TY - JOUR
T1 - Redundancy-weighting the PDB for detailed secondary structure prediction using deep-learning models
AU - Keasar, Chen
AU - Sidi, Tomer
N1 - Funding Information:
The authors are grateful for support by the Israel Science Foundation (ISF) [1122/14].
Publisher Copyright:
© 2020 The Author(s) 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
PY - 2020/3/31
Y1 - 2020/3/31
N2 - Motivation: The Protein Data Bank (PDB), the ultimate source for data in structural biology, is inherently imbalanced. To alleviate biases, virtually all structural biology studies use nonredundant (NR) subsets of the PDB, which include only a fraction of the available data. An alternative approach, dubbed redundancy-weighting (RW), down-weights redundant entries rather than discarding them. This approach may be particularly helpful for machine-learning (ML) methods that use the PDB as their source for data. Methods for secondary structure prediction (SSP) have greatly improved over the years with recent studies achieving above 70% accuracy for eight-class (DSSP) prediction. As these methods typically incorporate ML techniques, training on RW datasets might improve accuracy, as well as pave the way toward larger and more informative secondary structure classes. Results: This study compares the SSP performances of deep-learning models trained on either RW or NR datasets. We show that training on RW sets consistently results in better prediction of 3- (HCE), 8- (DSSP) and 13-class (STR2) secondary structures.
AB - Motivation: The Protein Data Bank (PDB), the ultimate source for data in structural biology, is inherently imbalanced. To alleviate biases, virtually all structural biology studies use nonredundant (NR) subsets of the PDB, which include only a fraction of the available data. An alternative approach, dubbed redundancy-weighting (RW), down-weights redundant entries rather than discarding them. This approach may be particularly helpful for machine-learning (ML) methods that use the PDB as their source for data. Methods for secondary structure prediction (SSP) have greatly improved over the years with recent studies achieving above 70% accuracy for eight-class (DSSP) prediction. As these methods typically incorporate ML techniques, training on RW datasets might improve accuracy, as well as pave the way toward larger and more informative secondary structure classes. Results: This study compares the SSP performances of deep-learning models trained on either RW or NR datasets. We show that training on RW sets consistently results in better prediction of 3- (HCE), 8- (DSSP) and 13-class (STR2) secondary structures.
UR - http://www.scopus.com/inward/record.url?scp=85087320482&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btaa196
DO - 10.1093/bioinformatics/btaa196
M3 - Article
C2 - 32186698
AN - SCOPUS:85087320482
SN - 1367-4803
VL - 36
SP - 3733
EP - 3738
JO - Bioinformatics
JF - Bioinformatics
IS - 12
ER -