Joint ICE-TCS and GSSI virtual seminar: Nicola Prezza (LUISS, Rome)
Regular Languages meet Prefix Sorting
Date and time: 11 June 2020, 13:30 GMT/14:30 BST/15:30 Italian time
Title: Regular Languages meet Prefix Sorting
Speaker: Nicola Prezza (LUISS, Rome) WWW: https://
Abstract: Indexing strings via prefix (or suffix) sorting is, arguably, one of the most successful algorithmic techniques developed in the last decades. Can indexing be extended to languages? This problem finds important recent applications in bioinformatics, where the goal has recently shifted towards indexing collections of millions of genomes representing entire populations (rather than just a single genome). In this work, we initiate the study of Wheeler languages, that is, the sub-class of regular languages accepted by an automaton whose states can be prefix-sorted. We first characterize this family as the natural extension of regular languages endowed with the co-lexicographic ordering: the sorted prefixes of strings belonging to a Wheeler language are partitioned into a finite number of co-lexicographic intervals, each formed by elements from a single Myhill-Nerode equivalence class. We proceed by proving several results related to Wheeler automata, including a determinization algorithm for Wheeler NFAs which only doubles the number of states, polynomial-time sorting and minimization algorithms for Wheeler DFAs, and an algorithm that converts any acyclic DFA into the smallest equivalent Wheeler DFA. The latter contribution, in particular, is the first minimum-size solution to the well-studied problem of indexing deterministic-acyclic graphs for linear-time pattern matching queries, and can be directly applied to index graphs representing populations of genomes within compressed space.