Universal Levenshtein Automata 2

Part 2: Levenshtein Automata for Words

November 25, 2021,
keywords: automata, fuzzy string matching, levenshtein

The series has been put on hold for now. I’m not happy with how the writing is going so for. There’s probably a better way to present the material that I need to think about.

The following list of posts in this series will be updated as I add more posts.

In the previous post, we looked at Levenshtein distances and proved a simple lemma about delete edits. Today we will look at Levenshtein automata for fixed words for a fixed maximal edit distance kk. The automaton A(p,k)A(p,k) will recognize all words tt that are at most kk distance away from tt.

Some Terminology

We will use Σ\Sigma to denote the alphabet that characters of our strings come from. Σ\Sigma does not include the sentinel character $\$.

For a character cc, we will use the notation q0cq1q_0 \xrightarrow{c} q_1 to denote a transition from the state q0q_0 to q1q_1 on input character cc. q0Σq1q_0 \xrightarrow{\Sigma} q_1 will denote the set of transitions {q0cq1cΣ}\{q_0 \xrightarrow{c} q_1\,|\, c \in \Sigma\}.

Levenshtein Automaton for a Fixed Word

Definition: The non-deterministic Levenshtein automaton A(p,k)A(p,k) for a string p=p1pmp=p_1 \cdots p_m with maximal edit distance kk is defined as follows.

  1. The states of the automaton are {ij0im,0jk}\{i^j\,|\,0 \leq i \leq m, 0 \leq j \leq k\}. Note: iji^j is not exponentiation. It’s just syntax.
  2. 000^0 is the only start state.
  3. The final states are mjm^j for all 0jk0 \leq j \leq k.
  4. The transitions are:
    • (i1)jpiij{(i-1)}^j \xrightarrow{p_i} i^j (for matches).
    • ijΣij+1i^j \xrightarrow{\Sigma} i^{j+1} (for inserts).
    • ijΣ(i+1)j+1i^j \xrightarrow{\Sigma} {(i+1)}^{j+1} (for substitutions).
    • ijε(i+1)j+1i^j \xrightarrow{\varepsilon} {(i+1)}^{j+1} (for deletes).

Example: The figure below shows the automaton A(abac,2)A(\texttt{abac},2).

Levenshtein Automaton for the word abac.

It is easy to see that any path from 000^0 to 4j4^j into a stream of matches and character edits and that the atomaton allows for at most 22 character edits. Rigorous proofs of this can be found in the literature.

Removing Epsilon Transitions

Using the lemma about delete edits from the previous post, we can easily remove the epsilon transitions. We simply replace the epsilon transitions with the following and add some additional final states.

  • (il1)jpiij+l{(i-l-1)}^j \xrightarrow{p_i} i^{j+l} (for deletes).

We add the states from which we could traverse to some mjm^j via epsilon transitions to the set of final states.

  • The final states are (ml)j{(m-l)}^j for all 0j,lk0 \leq j,l \leq k where j+lkj+l \leq k.

Example: The figure below shows the automaton for abac with k=2k=2 without epsilon transitions.

Levenshtein Automaton for the word abac without epsilon transitions.

Single Column of Final States

For removing epsilon transitions, we had to add additional final states in case the input text was too short. We can go back to final states being in a single column if we use the sentinel character to signal the end of the text and the end of the input.

For simplicity, we extend the alphabet to Σ=Σ{$}\Sigma' = \Sigma \cup \{\$\}. We will add some additional states (m+1)j(m+1)^{j} for 0jk0\leq j \leq k.

Given an input string tt of length nm+kn \leq m + k, we will feed extra $\$ characters to the automaton till we have fed exactly m+km + k characters.

Example: The figure below shows the automaton for abac with k=2k=2 without epsilon transitions which allows for the sentinel character as an input.

Levenshtein Automaton for the word abac with end character.