TY - GEN
T1 - FiSSC
T2 - 15th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2024
AU - Tziony, Ido
AU - Mandl, Jonathan
AU - Shapira, Kobi
AU - Eisenberg, Eli
AU - Porat, Ely
AU - Orenstein, Yaron
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/12/16
Y1 - 2024/12/16
N2 - High-throughput sequencing (HTS) is the most established technique to measure transcript abundance. HTS reads often contain uncertain or low-quality base calls that introduce ambiguity in determining the underlying sequence. In many applications, these unresolved nucleotides are handled by looking at the consensus sequence of all HTS reads. However, this approach is not applicable where sequence heterogeneity is of biological relevance. To gauge the biological complexity of a set of HTS reads in face of unresolved base calls, one may apply the parsimony principle, i.e., find a smallest set of sequences that cover all ambiguous reads. But, no method to date solves this problem optimally. Here, we present FiSSC, a new method to find a smallest sequence cover of a set of ambiguous reads. We first prove that the problem is NP-hard. We then present filtering steps that preserve optimal solution size, and an integer-linear-programming formulation, which together form FiSSC. We tested FiSSC on A-to-I RNA editing datasets with binary ambiguities. FiSSC outperformed all baseline methods and achieved optimal results in all but one dataset. We expect FiSSC to advance the study of sequence variation and biological complexity of ambiguous reads in various biological domains.
AB - High-throughput sequencing (HTS) is the most established technique to measure transcript abundance. HTS reads often contain uncertain or low-quality base calls that introduce ambiguity in determining the underlying sequence. In many applications, these unresolved nucleotides are handled by looking at the consensus sequence of all HTS reads. However, this approach is not applicable where sequence heterogeneity is of biological relevance. To gauge the biological complexity of a set of HTS reads in face of unresolved base calls, one may apply the parsimony principle, i.e., find a smallest set of sequences that cover all ambiguous reads. But, no method to date solves this problem optimally. Here, we present FiSSC, a new method to find a smallest sequence cover of a set of ambiguous reads. We first prove that the problem is NP-hard. We then present filtering steps that preserve optimal solution size, and an integer-linear-programming formulation, which together form FiSSC. We tested FiSSC on A-to-I RNA editing datasets with binary ambiguities. FiSSC outperformed all baseline methods and achieved optimal results in all but one dataset. We expect FiSSC to advance the study of sequence variation and biological complexity of ambiguous reads in various biological domains.
KW - ILP
KW - independent set
KW - NP-hard
KW - sequence cover
UR - http://www.scopus.com/inward/record.url?scp=85216426864&partnerID=8YFLogxK
U2 - 10.1145/3698587.3701363
DO - 10.1145/3698587.3701363
M3 - Conference contribution
AN - SCOPUS:85216426864
T3 - ACM-BCB 2024 - 15th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
BT - ACM-BCB 2024 - 15th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
PB - Association for Computing Machinery, Inc
Y2 - 22 November 2024 through 25 November 2024
ER -