# Lower bounds for text indexing with mismatches and differences

Research paper by **Vincent Cohen-Addad, Laurent Feuilloley, Tatiana Starikovskaya**

Indexed on: **21 Dec '18**Published on: **21 Dec '18**Published in: **arXiv - Computer Science - Data Structures and Algorithms**

Join Sparrho today to stay on top of science

Discover, organise and share research that matters to you

Join Sparrho today to stay on top of science

Discover, organise and share research that matters to you

Join for free

#### Abstract

In this paper we study lower bounds for the fundamental problem of text
indexing with mismatches and differences. In this problem we are given a long
string of length $n$, the "text", and the task is to preprocess it into a data
structure such that given a query string $Q$, one can quickly identify
substrings that are within Hamming or edit distance at most $k$ from $Q$. This
problem is at the core of various problems arising in biology and text
processing. While exact text indexing allows linear-size data structures with
linear query time, text indexing with $k$ mismatches (or $k$ differences) seems
to be much harder: All known data structures have exponential dependency on $k$
either in the space, or in the time bound. We provide conditional and
pointer-machine lower bounds that make a step toward explaining this
phenomenon. We start by demonstrating lower bounds for $k = \Theta(\log n)$. We
show that assuming the Strong Exponential Time Hypothesis, any data structure
for text indexing that can be constructed in polynomial time cannot have
$\mathcal{O}(n^{1-\delta})$ query time, for any $\delta>0$. This bound also
extends to the setting where we only ask for $(1+\varepsilon)$-approximate
solutions for text indexing. However, in many applications the value of $k$ is
rather small, and one might hope that for small~$k$ we can develop more
efficient solutions. We show that this would require a radically new approach
as using the current methods one cannot avoid exponential dependency on $k$
either in the space, or in the time bound for all even $\frac{8}{\sqrt{3}}
\sqrt{\log n} \le k = o(\log n)$. Our lower bounds also apply to the dictionary
look-up problem, where instead of a text one is given a set of strings.