Universal Levenshtein Automata 1

Part 1: Levenshtein Distance and Delete Edits

November 25, 2021,
keywords: automata, fuzzy string matching, levenshtein

My recent discovery of (Non-deterministic) Universal Levenshtein Automata (NULA) have been really engaging creatively. In this series of posts, I am going to collect together some definitions, theorems, and proofs about Levenshtein Distances and NULAs, and then provide a second set of definitions so that we can use bitwise operations for the implementation of NULAs.

The series has been put on hold for now. I’m not happy with how the writing is going so for. There’s probably a better way to present the material that I need to think about.

The following list of posts in this series will be updated as I add more posts.

Levenshtein Distance

Definition: The Levenshtein Distance between two strings is the minimal number of character edits required to change one string into the other. A character edit is inserting a character, deleting a character, or substituting a character for another.

Example: The Levenshtein distance between "abc" and "abd" is 1. The character edit substitutes 'c' with 'd'.

Example: The Levenshtein distance between "abc" and "abdc" is 1. The character edit inserts 'd' before 'c'.

Example: The Levenshtein distance between "abc" and "ac" is 1. The character edit delete the 'b'.

Example: The Levenshtein distance between "abc" and "acd" is 2. There are a few ways of doing two edits to "abc" to get "acd". One way is to substitute the 'b' for 'c' and the 'c' for 'd'. Another way is to delete the 'b' and insert the 'd'. This shows that the character edits involved need not be unique.

In some areas of applications, it is useful to consider more kinds of edits. For example, in spell checking inputs from a keyboard, you might consider swapping two adjacent characters (transpositions) as an edit, and in optical character recognition you might consider splitting one character into two and merging two characters into one as edits.

Some Terminology

Let tt be a string of length nn. We will represents the ii-th character of tt as tit_i. So t=t1t2tnt = t_1 t_2 \cdots t_n.

We will represent character edits as follows.

  1. Substituting the ii-th character will be represented as sis_i.
  2. Deleting the ii-th character will be represented as εi\varepsilon_i.
  3. An insert edit will be represented as iki_k, where kk is the kk-th insert edit.

We will not use ss or ii for strings, and it will be clear from context whether we are using ii as an index or as an insert edit.

We will use $\$ as a sentinel character when needed, often to represent the end of strings. Replacting $\$ will not be allowed for substitution edits.

Example: Let tt be the string "abc". Then t1t_1 = 'a', t2t_2 = 'b', and t3t_3 = 'c'. One way of minimally editing tt to get "acd" can be represented as t1s2s3t_1 s_2 s_3, where s2s_2 = 'c' and s3s_3 = 'd'. Another way of minimally editing tt to get "acd" can be represented as t1ε2t3i1t_1 \varepsilon_2 t_3 i_1, where i1i_1 = 'd'.

A Lemma About Delete Edits

Lemma: Let pp and tt be two strings. Given a set of minimal character edits to transform pp to tt, we can rearrange the edits so that every maximal sequence of delete edits are followed by a character match or are at end of the string.

Proof: We will shift maximal sequences of delete edits to the right till they are followed by a character match or they are at the end of the string.

Suppose we have the sequence εkεk+1εk+jin\varepsilon_k \varepsilon_{k+1} \cdots \varepsilon_{k + j} i_n in tt, i.e. we are deleting jj characters starting at character pkp_k and then inserting a character. This can be rewritten as inεkεk+1εk+ji_n \varepsilon_k \varepsilon_{k+1} \cdots \varepsilon_{k + j}. This changes the edits so that we are inserting a character first, and then deleting jj characters starting at pkp_k.

Suppose we have the sequence εkεk+1εk+jsk+j+1\varepsilon_k \varepsilon_{k+1} \cdots \varepsilon_{k + j} s_{k + j + 1} in tt with sk+j+1pk+j+1s_{k + j + 1} \neq p_{k + j + 1}. Here we are deleting jj characters starting at character pkp_k and then substituting the k+j+1k+j+1-th character. This can be rewritten as skεk+1εk+2εk+j+1s_{k} \varepsilon_{k+1} \varepsilon_{k+2} \cdots \varepsilon_{k + j + 1}. This changes the edits so that we are substituting the kk-th character, and then deleting jj characters starting at pk+1p_{k+1}. Note that sks_k cannot equal pkp_k since that would mean we found a smaller set of edits than the minimal set of edits we started with.

The above two steps can be repeated for maximal sequences of delete edits in tt until all maximal sequences of delete edits are followed by a match pkp_k or are at the end of the string. \square

When we allow transpositions as character edits, we must also allow sequences of delete edits to be followed by a transposition.

Update We do not need to worry about delete edits followed by a transposition. A delete followed by a transposition is two edits, but so is a substitution followed by a match followed by a delete. If our string is tt = abc, and we are matching with cb, we can write cb as s1t2ε3s_1 t_2 \varepsilon_3. Longer delete sequences followed by a transposition εiεi+1εi+kti+k+2ti+k+1\varepsilon_i \varepsilon_{i+1} \cdots \varepsilon_{i+k} t_{i+k+2} t_{i+k+1}, can be similarly thought of as εiεi+1εi+(k1)si+kti+k+1εi+k+2\varepsilon_i \varepsilon_{i+1} \cdots \varepsilon_{i+(k-1)} s_{i+k} t_{i+k+1} \varepsilon_{i+k+2}, where si+k=ti+k+2s_{i+k} = t_{i+k+2}.