Character sets of strings

  • Gilles Didier
  • , Thomas Schmidt
  • , Jens Stoye
  • , Dekel Tsur

    Research output: Contribution to journalArticlepeer-review

    27 Scopus citations

    Abstract

    Given a string S over a finite alphabet Σ, the character set (also called the fingerprint) of a substring S of S is the subset C ⊆ Σ of the symbols occurring in S. The study of the character sets of all the substrings of a given string (or a given collection of strings) appears in several domains such as rule induction for natural language processing or comparative genomics. Several computational problems concerning the character sets of a string arise from these applications, especially: (1)Output all the maximal locations of substrings having a given character set.(2)Output for each character set C occurring in a given string (or a given collection of strings) all the maximal locations of C. Denoting by n the total length of the considered string or collection of strings, we solve the first problem in Θ (n) time using Θ (n) space. We present two algorithms solving the second problem. The first one runs in Θ (n2) time using Θ (n) space. The second algorithm has Θ (n | Σ | log | Σ |) time and Θ (n) space complexity and is an adaptation of an algorithm by Amir et al. [A. Amir, A. Apostolico, G.M. Landau, G. Satta, Efficient text fingerprinting via Parikh mapping, J. Discrete Algorithms 26 (2003) 1-13].

    Original languageEnglish
    Pages (from-to)330-340
    Number of pages11
    JournalJournal of Discrete Algorithms
    Volume5
    Issue number2 SPEC. ISS.
    DOIs
    StatePublished - 1 Jan 2007

    Keywords

    • Character sets
    • Combinatorial algorithms on words
    • Comparative genomics
    • Fingerprints
    • Natural language processing

    ASJC Scopus subject areas

    • Theoretical Computer Science
    • Discrete Mathematics and Combinatorics
    • Computational Theory and Mathematics

    Fingerprint

    Dive into the research topics of 'Character sets of strings'. Together they form a unique fingerprint.

    Cite this