Suppose we want to transform one string into another. How might we describe the distance between the two words mathematically? The most common form of distance metric for strings is the Levenshtein distance between the first characters of and the first characters of , given by
where is the indicator function equal to when and equal to otherwise. The Levenshtein distance corresponds to three possible operations on a string to transform it into another:
We can compute the Levenshtein distance as follows
def levenshtein(s1,s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for index2, char2 in enumerate(s2):
newDistances = [index2+1]
for index1, char1 in enumerate(s1):
if char1 == char2:
newDistances.append(distances[index1])
else:
newDistances.append(1 + min((distances[index1],
distances[index1+1],
newDistances[-1])))
distances = newDistances
return distances[-1]
and use it like so
In [2]: levenshtein('philosophy', 'mathematics')
Out[2]: 11
In [3]: %timeit levenshtein('philosophy', 'mathematics')
27 µs ± 725 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Note that this does not require that the intermediate words in the transformation be valid, this merely calculates the length of the shortest path between two words under the insertion, deletion, and substitution operations.
The Levenshtein distance forms a metric space . A metric space is a set and a distance operator , i.e. a function . If a set and a distance operator satisfy the following criteria for all then the set and the metric form a metric space.
In terms of the Levenshtein distance between two strings, the last item means that the path from to cannot be longer than a path that goes through a point in between and . ( )
There are other string metrics but the Levenshtein distance is the canonical one.
Another technique is to convert each word to a vector in a continuous vector space, and then take the distance between the vectors. This requires a fairly large data set to convert the words to vectors in ways that preserve their meanings. However, this technique doesn't calculate the edit distance between the words, but rather their semantic distance.