TY - JOUR
T1 - A 3d sequence-independent representation of the protein data bank
AU - Fischer, Daniel
AU - Tsai, Chung Jung
AU - Nussinov, Ruth
AU - Wolfson, Haim
N1 - Funding Information:
We thank Drs Nickolai Alexandrov, Robert Jernigan and, in particular, Jacob Maize], for helpful discussions, encouragement and interest. We thank the personnel at the Frederick Cancer Research and Development Center for their assistance. The research of R.Nussinov has been sponsored by the National Cancer Institute, DHHS, under Contract no. l-CO-74102 with Program Resources, Inc. The research of HJ.Wolfson has been supported in part by a grant from the Israel Science Foundation administered by the Israel Academy of Sciences. The research of R.Nussinov in Israel has been supported in part by grant no. 91-00219 from the US-Israel Binational Science Foundation (BSF), and by a grant from the Israel Science Foundation administered by the Israel Academy of Sciences. This work formed part of the Ph.D. Thesis of D.Fischer, Tel Aviv University, and was partially carried out when visiting at the Laboratory of Mathematical Biology, NCI. The contents of this publication do not necessarily reflect the views or policies of the DHHS, nor does mention of trade names, commercial products or organization imply endorsement by the US Government
PY - 1995/10/1
Y1 - 1995/10/1
N2 - Here we address the following questions. How many structurally different entries are there in the Protein Data Bank (PDB)? How do the proteins populate the structural universe? To investigate these questions a structurally nonredundant set of representative entries was selected from the PDB. Construction of such a dataset is not trivial: (i) the considerable size of the PDB requires a large number of comparisons (there were more than 3250 structures of protein chains available in May 1994); (ii) the PDB is highly redundant, containing many structurally similar entries, not necessarily with significant sequence homology, and (iii) there is no clear-cut definition of structural similarity. The latter depend on the criteria and methods used. Here, we analyze structural similarity ignoring protein topology. To date, representative sets have been selected either by hand, by sequence comparison techniques which ignore the three-dimensional (3D) structures of the proteinsor by using sequence comparisons followed by linearstructural comparison (i.e. the topology, or the sequential order of the chains, is enforced in the structural comparison). Here we describe a 3D sequence-independent automated and efficient method to obtain a representative set of protein molecules from the PDB which contains all unique structures and which is structurally non-redundant. The method has two novel features. The first is the use of strictly structural criteria in the selection process without taking into account the sequence information. To this end we employ a fast structural comparison algorithm which requires on average {reversed tilde}2 s per pairwise comparison on a workstation. The second novel feature is the iterative application of a heuristic clustering algorithm that greatly reduces the number of comparisons required. We obtain a representative set of 220 chains with resolution better than 3.0 Å, or 268 chains including lower resolution entries, NMR entries and models. The resulting set can serve as a basis for extensive structural classification and studies of 3D recurring motifs and of sequence-structure relationships. The clustering algorithm succeeds in classifying into the same structural family chains with no significant sequence homology, e.g. all the globins in one single group, all the trypsin-like serine proteases in another or all the immunoglobulin-like folds into a third. In addition, unexpected structural similarities of interest have been automatically detected between pairs of chains. A cluster analysis of the representative structures demonstrates the way the 'structural universe' is populated.
AB - Here we address the following questions. How many structurally different entries are there in the Protein Data Bank (PDB)? How do the proteins populate the structural universe? To investigate these questions a structurally nonredundant set of representative entries was selected from the PDB. Construction of such a dataset is not trivial: (i) the considerable size of the PDB requires a large number of comparisons (there were more than 3250 structures of protein chains available in May 1994); (ii) the PDB is highly redundant, containing many structurally similar entries, not necessarily with significant sequence homology, and (iii) there is no clear-cut definition of structural similarity. The latter depend on the criteria and methods used. Here, we analyze structural similarity ignoring protein topology. To date, representative sets have been selected either by hand, by sequence comparison techniques which ignore the three-dimensional (3D) structures of the proteinsor by using sequence comparisons followed by linearstructural comparison (i.e. the topology, or the sequential order of the chains, is enforced in the structural comparison). Here we describe a 3D sequence-independent automated and efficient method to obtain a representative set of protein molecules from the PDB which contains all unique structures and which is structurally non-redundant. The method has two novel features. The first is the use of strictly structural criteria in the selection process without taking into account the sequence information. To this end we employ a fast structural comparison algorithm which requires on average {reversed tilde}2 s per pairwise comparison on a workstation. The second novel feature is the iterative application of a heuristic clustering algorithm that greatly reduces the number of comparisons required. We obtain a representative set of 220 chains with resolution better than 3.0 Å, or 268 chains including lower resolution entries, NMR entries and models. The resulting set can serve as a basis for extensive structural classification and studies of 3D recurring motifs and of sequence-structure relationships. The clustering algorithm succeeds in classifying into the same structural family chains with no significant sequence homology, e.g. all the globins in one single group, all the trypsin-like serine proteases in another or all the immunoglobulin-like folds into a third. In addition, unexpected structural similarities of interest have been automatically detected between pairs of chains. A cluster analysis of the representative structures demonstrates the way the 'structural universe' is populated.
KW - Geometric hashing
KW - Non-redundant dataset of protein structures
KW - Protein structural classification
KW - Protein structure comparison
KW - Sequence-order dependence
KW - Sequence-structure relationship
UR - http://www.scopus.com/inward/record.url?scp=0029564871&partnerID=8YFLogxK
U2 - 10.1093/protein/8.10.981
DO - 10.1093/protein/8.10.981
M3 - Article
AN - SCOPUS:0029564871
SN - 1741-0126
VL - 8
SP - 981
EP - 997
JO - Protein Engineering, Design and Selection
JF - Protein Engineering, Design and Selection
IS - 10
ER -