Description
Bacterial small RNAs (sRNAs) are pivotal in post-transcriptional regulation, affecting functions like virulence, metabolism, and gene expression by binding specific mRNA targets. Identifying these targets is crucial to understanding sRNA regulation across species. Despite advancements in high-throughput (HT) experimental methods, they remain technically challenging and are limited to detecting sRNA-target interactions under specific environmental conditions. Therefore, computational approaches, especially machine learning (ML), are essential for identifying strong candidates for biological validation. In this study, we hypothesize that ML models trained on large-scale interaction data from specific conditions can accurately predict new interactions in unseen conditions within the same bacterial strain. To test this, we developed models from two families: (1) graph neural networks (GNNs), including GraphRNA and kGraphRNA, that learn transformed representations of interacting sRNA-mRNA pairs via graph relationships, and (2) decision forests, sInterRF (Random Forest) and sInterXGB (XGBoost), which use various interaction features for prediction. We also proposed Summation Ensemble Models (SEM) that combine scores from multiple models. Across three seen-to-unseen conditions evaluations, our models —particularly kGraphRNA— significantly improved the area under the ROC curve (AUC) and Precision-Recall curve (PR-AUC) compared to sRNARFTarget, CopraRNA, and RNAup. The SEM model combining GraphRNA and CopraRNA outperformed CopraRNA alone on a low-throughput (LT) interactions test set (HT-to-LT). This data source provides the EcoCyc metadata about all the sRNAs and mRNAs of Escherichia coli K12 MG1655 (NC_000913) and the HT and LT interaction datasets used for our study (see: Data/Datasets). In addition, we provide the prediction scores of our models: kGraphRNA, GraphRNA, sInterRF, and sInterXGB for any pair of sRNA and mRNA of Escherichia coli K12 MG1655 (NC_000913). We also provide the true labels and the CopraRNA p-value scores computed for all possible pairs. Note that prediction scores are not provided for sRNA-mRNA pairs that were used to train the models, i.e., all the labeled interactions (HT and LT) and negative interactions sampled randomly (see our paper for more details). For convenience, each CVS file contains the scores of a single sRNA with the following information: accession IDs, locus tags, and names of the sRNA the mRNA; CopraRNA p-value (if available); the prediction scores of kGraphRNA, GraphRNA, sInterRF, and sInterXGB models; true label (if available) – 1 for interaction and 0 for non-interaction; whether the sRNA-mRNA pair was sampled for the train set as a random negative sample – true or false.
Date made available | 2024 |
---|---|
Publisher | ZENODO |