Levenshtein Distance

Suppose we want to transform one string into another. How might we describe the distance between the two words mathematically? The most common form of distance metric for strings is the Levenshtein distance between the first $i$ characters of $a$ and the first $j$ characters of $b$ , given by

$\displaystyle\operatorname{lev}_{a, b} (i, j) = \begin{cases} \max (i, j) & \text{if} \min (i, j) = 0, \\ \min\begin{cases}\operatorname{lev}_{a, b} (i-1, j) + 1 \\ \operatorname{lev}_{a, b} (i, j - 1) + 1 \\ \operatorname{lev}_{a, b} (i-1, j-1) + \operatorname{1}_{(a_i \neq b_j)}\end{cases} & \text{otherwise.} \end{cases}$

where $\operatorname{1}_{(a_i \neq b_j)}$ is the indicator function equal to $0$ when $a_i = b_j$ and equal to $1$ otherwise. The Levenshtein distance corresponds to three possible operations on a string to transform it into another:

The insertion of a single character
The deletion of a single character
The substitution of one character for another

We can compute the Levenshtein distance as follows

def levenshtein(s1,s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1
    distances = range(len(s1) + 1)
    for index2, char2 in enumerate(s2):
        newDistances = [index2+1]
        for index1, char1 in enumerate(s1):
            if char1 == char2:
                newDistances.append(distances[index1])
            else:
                newDistances.append(1 + min((distances[index1],
                                            distances[index1+1],
                                            newDistances[-1])))
        distances = newDistances
    return distances[-1]

and use it like so

In [2]: levenshtein('philosophy', 'mathematics')
Out[2]: 11

In [3]: %timeit levenshtein('philosophy', 'mathematics')
27 µs ± 725 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Note that this does not require that the intermediate words in the transformation be valid, this merely calculates the length of the shortest path between two words under the insertion, deletion, and substitution operations.

The Levenshtein distance forms a metric space . A metric space is a set $M$ and a distance operator $\operatorname{d}(x, y)$ , i.e. a function $\operatorname{d}: M \times M \to \mathbb{R}$ . If a set $M$ and a distance operator $\operatorname{d}(x, y)$ satisfy the following criteria for all $x, y \in M$ then the set $M$ and the metric $\operatorname{d}$ form a metric space.

The distance cannot be negative
$\operatorname{d}(x, y) = 0$ if and only if $x = y$
The distance operator is symmetric : $\operatorname{d}(x, y) = \operatorname{d}(y, x)$
The distance operator follows the triangle inequality: $\operatorname{d}(x, z) \leq \operatorname{d}(x, y) + \operatorname{d}(y, z)$

In terms of the Levenshtein distance between two strings, the last item means that the path from $a$ to $c$ cannot be longer than a path that goes through a point $b$ in between $a$ and $c$ . ( $a \to b \to c$ )

There are other string metrics but the Levenshtein distance is the canonical one.

Another technique is to convert each word to a vector in a continuous vector space, and then take the distance between the vectors. This requires a fairly large data set to convert the words to vectors in ways that preserve their meanings. However, this technique doesn't calculate the edit distance between the words, but rather their semantic distance.