TY - GEN
T1 - Predicting G-quadruplexes from DNA sequences using multi-kernel convolutional neural networks
AU - Barshai, Mira
AU - Orenstein, Yaron
N1 - Funding Information:
We gratefully acknowledge the support of Intel Corporation for giving access to the Intel®AI DevCloud platform used for part of this work.
Publisher Copyright:
© 2019 Copyright held by the owner/author(s).
PY - 2019/9/4
Y1 - 2019/9/4
N2 - G-quadruplexes are nucleic acid secondary structures that form within guanine-rich DNA or RNA sequences. G-quadruplex formation can affect chromatin architecture and gene regulation and has been associated with genomic instability, genetic diseases and cancer progression. G-quadruplex formation in a DNA template can be assessed using polymerase stop assays, which measure polymerase stalling at G-quadruplex sites. An experimental technique, called G4-seq, was developed by combining features of the polymerase stop assay with Illumina next-generation sequencing. The experimental data produced by this technique provides unprecedented details on where and at what intensity do G-quadruplexes form in the human genome. Still, running the experimental protocol on a whole genome is an expensive and time-consuming process. Thus, it is highly desirable to have a computational method to predict G-quadruplex formation of new DNA sequences or whole genomes. Here, we present a new method, called G4detector, to predict G-quadruplexes from DNA sequences based on multi-kernel convolutional neural networks. To test G4detector, we compiled novel high-throughput in vitro and in vivo benchmarks. On these data, we show that G4detector outperforms extant methods for the same task on all benchmark datasets. We visualize the most important features of G4detector models and discover that G-quadruplex formation is highly depended on G-tracts length, their spacing and nucleotide composition between them. The code and benchmarks are publicly available on github.com/OrensteinLab/G4detector.
AB - G-quadruplexes are nucleic acid secondary structures that form within guanine-rich DNA or RNA sequences. G-quadruplex formation can affect chromatin architecture and gene regulation and has been associated with genomic instability, genetic diseases and cancer progression. G-quadruplex formation in a DNA template can be assessed using polymerase stop assays, which measure polymerase stalling at G-quadruplex sites. An experimental technique, called G4-seq, was developed by combining features of the polymerase stop assay with Illumina next-generation sequencing. The experimental data produced by this technique provides unprecedented details on where and at what intensity do G-quadruplexes form in the human genome. Still, running the experimental protocol on a whole genome is an expensive and time-consuming process. Thus, it is highly desirable to have a computational method to predict G-quadruplex formation of new DNA sequences or whole genomes. Here, we present a new method, called G4detector, to predict G-quadruplexes from DNA sequences based on multi-kernel convolutional neural networks. To test G4detector, we compiled novel high-throughput in vitro and in vivo benchmarks. On these data, we show that G4detector outperforms extant methods for the same task on all benchmark datasets. We visualize the most important features of G4detector models and discover that G-quadruplex formation is highly depended on G-tracts length, their spacing and nucleotide composition between them. The code and benchmarks are publicly available on github.com/OrensteinLab/G4detector.
KW - Convolutional neural networks
KW - G-quadruplex
UR - http://www.scopus.com/inward/record.url?scp=85073154143&partnerID=8YFLogxK
U2 - 10.1145/3307339.3342133
DO - 10.1145/3307339.3342133
M3 - Conference contribution
AN - SCOPUS:85073154143
T3 - ACM-BCB 2019 - Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
SP - 357
EP - 365
BT - ACM-BCB 2019 - Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
PB - Association for Computing Machinery, Inc
T2 - 10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2019
Y2 - 7 September 2019 through 10 September 2019
ER -