Abstractdriven our data but not necessarily our knowledge. In fact, the gap is actually widening between the generated data and our functional understanding. This thesis focuses on developing and using methodologies to structure the data, to gain knowledge at different levels. The most successful methods for functional annotation of proteins use sequence alignment, homology detection, and inference techniques, referred to as ”guilt by association”. Generalization of such approaches calls for charting the protein space by clustering or classification. If successful, each such group of proteins would represent a ”family”. Classification into families is a critical component in structural and functional genomics. No accepted consensus exists for how many of these protein families might comprise the entire protein-space. While a solid definition for a family is feasible for single domain proteins, defining families for multi-domain proteins is less evident. According to OrthoMCL clustering, there are ⇠30,000 main orthologous groups in addition to thousands of rare and peculiar single proteins.
Classifying the entire protein space into families serves not only as a method for large-scale protein annotations but also to support functional and structural genomic initiatives. Some prominent examples for sequence based protein classification are SYSTERS, CluSTr and ProtoNet. The shared theme of all these resources is the hierar- chical nature of the protein families. Furthermore, while all use BLAST-based statistical distance metrics for the clustering, the implementation, sensitivity, coverage of the space, the notion of distance metrics, and consequently the depth of the hierarchical representations are different for each of the underlying algorithms and resources. The advantage of hierarchical clustering over other clustering methods is that the granularity level can be refined. Whereas near the leaves (e.g., proteins) we observe protein groups that are alike, at higher levels we find less fragmented groups that are each on average much larger. Other classification methods are based on defining a statistical model for a family, as in the case of the functional domains in Pfam, or by structural domains, like CATH and SCOP. However, all these recourses are semi-automated and thus are dependent on prior knowledge and manual curation effort.
The first part of the thesis (Chapter 2) focuses on classification of viral and host proteins to find events of mimicking, where viruses acquired sequences from their host. We used two definitions of related sequences: high- conservation, based on UniRef90 classification, and a highly diverse approach using Pfam family and domain classifications. We detected hundreds of overlooked events of sequences that were acquired by viruses. In addition, we were able demonstrate many cases in which the virus shortened the proteins by shortening the linker between the domains, or by eliminating the domains completely, while the domains themselves remain resistant to such shortening.
The input for the ProtoNet clustering scheme is a distance matrix of all against all sequences. For pairwise dis- tances we used BLAST derived statistical E-scores with a non-standard relaxed threshold of 100. The classification provided by ProtoNet is based on a bottom-up unsupervised agglomerative hierarchical clustering. Specifically, it provides a full range of cluster granularity, from single proteins to huge clusters that carry minimal biological coherence (the root clusters). Although the clustering was done in an unsupervised manner, based on sequences only, the gathered informa- tion complements various expert-based databases. For example, InterPro (IPR) is composed of 25,000 models for families and domains, and from a structural perspective, there are ⇠2,600 superfamilies according to CATH classification (recall that many proteins have no structural support and thus cannot be compared). The Gene On- tology (GO) that describes gene products in terms of the biological processes, cellular components, and molecular functions in a species-independent manner is currently the most exhaustive collection of multiple resolution an- notations. Each such protein information type is integrated in the ProtoNet platform in the following manner: A single protein’s (a leaf in the clustering tree) annotations are extracted from the Databases. Then, information is diffused up in the clustering tree. Each internal node and every annotation of any of its members are associated with statistical measurements like specificity, sensitivity, correspondence score and more.
In the following section (Chapter 3), we describe a test case for the usability of the ProtoNet platform towards new genome annotation. We extend the classification principles of the ProtoNet hierarchical clustering approach to annotate not only new individual proteins, but also a whole newly sequenced proteome of the Daphnia pulex, the first fully sequenced crustacean genome. The complete proteome includes 30,550 putative proteins. However, about 10,000 of them have no known homologues. Applying a method of post clustering of the D. pulex proteome, 98.7% (26,625 sequences) of the full-length proteins were successfully mapped to 13,880 ProtoNet stable clusters, and only 1.3% remained unmapped. Functional annotations were successfully assigned to 86% of the proteins. Most proteins (61%) were mapped to only 2,953 clusters that contain Daphnia’s duplicated genes. We focused on the functionality of the maximally amplified paralogs. Cuticle structure components and a variety of ion channel protein families were associated with a maximal level of gene amplification. We focused on gene amplification as a leading strategy of the Daphnia in coping with environmental toxicity and changes in conditions. We concluded that clustering remains fundamental in revealing new insights, even in cases that are missing annotations from the subject sequences.
In an unpublished work (Chapter 4, under review) we used the ProtoNet platform to cluster whole proteome
sets of 17 insects (including representatives from ants, bees, wasp, flies, mosquitos, beetles and louses) and a crustacean proteome as an out-group. We used only fully sequenced species to avoid misinterpretation stemming from missing knowledge. That is, by determining that a species does not have a homolog of a specific protein shared in other insects, we must be certain that we have scanned all of its coding capacity (i.e., the complete proteome). We used ProtoNet’s clustering to cluster about 300,000 proteins from these species. We then defined protein families according to a cut in the clustering tree. We noted that the Hymenoptera clade (ant, bee and wasp) has a significantly higher turnover rate of families with respect to the Diptera clade (fly and mosquito). For this task, no annotations were involved. When these protein families were assessed from a functional perspective, we found that the ants and the bees had a drastic expansion of protein families that are involved in nucleic acid metabolism and dynamics including transposases, integrases, DNA repair and more. We speculate that these two phenomena of fast dynamics in families (high turnover rate) and the overrepresentation of nucleic acid associated functions in the hymenoptera are strongly related. In Chapter 5 we illustrate the applicability of the approach towards the study of insect proteomes (ProtoBug) as a webtool that is open to the community. In conclusion, this thesis provides new approaches and methodologies towards (i) understanding evolutionary trends; (ii) closing the gap between genomic data and knowledge using automatic methods and (iii) clustering and classification of newly related genomes.
|Date of Award
|Michal Linial (Supervisor)