zoukankan      html  css  js  c++  java
  • 字符串的距离

    原文地址:http://davefancher.com/2012/04/22/string-distances-9/

    One of my first projects at Leaf has been trying to match some data read from an OCR solution against what is in the database.  As sophisticated as OCR algorithms have become though it’s still not reliable enough to guarantee 100% accurate results every time due to the number of variations in type faces, artifacts introduced through scanning or faxing the document, or any number of other factors.

    Most of the documents I’ve been working with have been pretty clean and I’ve been able to get an exact match automatically.  One of my samples though, has some security features that intentionally obfuscate some of the information I care about.  This naturally makes getting an exact match difficult.  Amazingly though, the OCR result was about 80% accurate so there was still some hope.

    One of my coworkers suggested that I look at some of the string distance algorithms to see if any of them could help us get closer to an exact match.  He pointed me at the Levenshtein algorithm so I took a look at that along with the Hamming and Damerau-Levenshtein algorithms.

    For the uninitiated (like myself a week ago), these algorithms provide a way to determine the distance between two strings.  The distance is essentially a measurement of string similarity. In other words, they calculate how many steps are required to transform one string into another.

    I want to look briefly at each of these and show some examples. Note that each of these algorithms are case sensitive but modifying them to ignore case is trivial.

    Hamming Distance in SQL SERVER

    create function HamDist(@value1 char(8000), @value2 char(8000))
    returns int
    as
    begin
        declare @distance int
        declare @i int
        declare @len int
    
        select @distance = 0, 
               @i =1, 
               @len = case when len(@value1) > len(@value2) 
                           then len(@value1) 
                           else len(@value2) end
    
        if (@value1 is null) or (@value2 is null)
            return null
    
        while (@i <= @len)
            select @distance = @distance + 
                               case when substring(@value1,@i,1) != substring(@value2,@i,1) 
                                    then 1 
                                    else 0 end,
                   @i = @i +1
    
        return @distance
    end
    

      

    Hamming Distance in linq

    Named for the American mathematician Richard Hamming, the Hamming distance is identifies how many characters differ in equal length strings.  Implementing this algorithm is pretty easy, especially with LINQ.

    public static partial class StringDistance
    {
    	public static int GetHammingDistance(string source, string target)
    	{
    		if(source.Length != target.Length)
    		{
    			throw new Exception("Strings must be equal length");
    		}
    
    		int distance =
    			source.ToCharArray()
    			.Zip(target.ToCharArray(), (c1, c2) => new { c1, c2 })
    			.Count(m => m.c1 != m.c2);
    
    		return distance;
    	}
    }
    

    To compute the Hamming distance we just project the character arrays from each string into a new collection (using Zip) and count the characters that don’t match.

    StringDistance.GetHammingDistance("abcde", "abcde").Dump(); // 0
    StringDistance.GetHammingDistance("abcde", "abcdz").Dump(); // 1
    StringDistance.GetHammingDistance("abcde", "abcyz").Dump(); // 2
    StringDistance.GetHammingDistance("abcde", "abxyx").Dump(); // 3
    StringDistance.GetHammingDistance("abcde", "awxyz").Dump(); // 4
    StringDistance.GetHammingDistance("abcde", "vwxyz").Dump(); // 5

    As you can see, when the strings are identical the Hamming distance is zero and when they’re completely different they’re equal to the length of the strings.

    There was some potential for using the Hamming distance in my project since each of the values I’m checking are supposed to be the same length but the OCR library threw a wrench in that plan by not recognizing some characters or mistaking smudges for additional characters. Since I couldn’t rely on equal length strings I moved on to the Levenshtein distance.

    Levenshtein Distance 

    The Levenshtein distance is named for the Russian scientist, Vladimir Levenshtein.  Calculating the Levenshtein distance is a bit more involved than the Hamming distance but since it applies to variable length strings it is applicable in more situations.  The Levenshtein distance takes into account not only character alterations but also insertions and deletions.

    public static partial class StringDistance
    {
        public static int GetLevenshteinDistance(string source, string target)
        {
            var bounds = new { Height = source.Length + 1, Width = target.Length + 1 };
    
            int[,] matrix = new int[bounds.Height, bounds.Width];
    
            for(int height = 0; height < bounds.Height; height++) { matrix[height, 0] = height; };
            for(int width = 0; width < bounds.Width; width++) { matrix[0, width] = width; };
    
            for(int height = 1; height < bounds.Height; height++)
            {
                for(int width = 1; width < bounds.Width; width++)
                {
                    int cost = (source[height - 1] == target[width - 1]) ? 0 : 1;
                    int insertion = matrix[height, width - 1] + 1;
                    int deletion = matrix[height - 1, width] + 1;
                    int substitution = matrix[height - 1, width - 1] + cost;
    
                    int distance = Math.Min(insertion, Math.Min(deletion, substitution));
    
                    matrix[height, width] = distance;
                }
            }
    
            return matrix[bounds.Height - 1, bounds.Width - 1];
        }
    }

    In order to calculate the distance, this algorithm uses a matrix to track the differences as it works over the two strings. Insertions are detected by looking at the slot to the left of the current position whereas deletions are detected by looking at the slot above the current position. Substitution detection requires examining the slot above and to the left of the current position. The final distance is the number stored in the bottom right slot.

    StringDistance.GetLevenshteinDistance("abcde", "abcde").Dump(); // 0
    StringDistance.GetLevenshteinDistance("abcde", "abcdz").Dump(); // 1
    StringDistance.GetLevenshteinDistance("abcde", "abcyz").Dump(); // 2
    StringDistance.GetLevenshteinDistance("abcde", "abxyx").Dump(); // 3
    StringDistance.GetLevenshteinDistance("abcde", "awxyz").Dump(); // 4
    StringDistance.GetLevenshteinDistance("abcde", "vwxyz").Dump(); // 5

    The above examples illustrate how the Levenshtein distance and the Hamming distance are identical for equal length strings but how does it work for strings of differing lengths?

    StringDistance.GetLevenshteinDistance("abcdefg", "abcde").Dump(); // 2
    StringDistance.GetLevenshteinDistance("abcdefg", "zxyde").Dump(); // 5

    The Levenshtein distance is a good fit for my project but consider these examples:

    StringDistance.GetLevenshteinDistance("abcde", "bacde").Dump(); // 2
    StringDistance.GetLevenshteinDistance("abcde", "baced").Dump(); // 4

    The major limitation of the Levenshtein distance is that it doesn’t take character transpositions into account. Does it really take two operations to transform “abcde” into “bacde” or just one? Frederick Damerau asked this same question and gave us the Damerau-Levenshtein distance.

    Damerau-Levenshtein Distance

    Frederick Damerau adapted Levenshtein’s process to account properly for transpositions of adjacent characters. There are two ways to calculate the Damerau-Levenshtein distance but the one we’ll look at here is known as optimal string alignment.

    The optimal string alignment algorithm is really just an extension of the Levenshtein distance algorithm.

    public static partial class StringDistance
    {
        public static int GetDamerauLevenshteinDistance(string source, string target)
        {
            var bounds = new { Height = source.Length + 1, Width = target.Length + 1 };
    
            int[,] matrix = new int[bounds.Height, bounds.Width];
    
            for(int height = 0; height < bounds.Height; height++) { matrix[height, 0] = height; };
            for(int width = 0; width < bounds.Width; width++) { matrix[0, width] = width; };
    
            for(int height = 1; height < bounds.Height; height++)
            {
                for(int width = 1; width < bounds.Width; width++)
                {
                    int cost = (source[height - 1] == target[width - 1]) ? 0 : 1;
                    int insertion = matrix[height, width - 1] + 1;
                    int deletion = matrix[height - 1, width] + 1;
                    int substitution = matrix[height - 1, width - 1] + cost;
    
                    int distance = Math.Min(insertion, Math.Min(deletion, substitution));
    
                    if (height > 1 && width > 1 && source[height - 1] == target[width - 2] && source[height - 2] == target[width - 1]) {
                        distance = Math.Min(distance, matrix[height - 2, width - 2] + cost);
                    }
    
                    matrix[height, width] = distance;
                }
            }
    
            return matrix[bounds.Height - 1, bounds.Width - 1];
        }
    }

    In this variation we do an additional check to handle transpositions.

    StringDistance.GetDamerauLevenshteinDistance("abcde", "abcde").Dump(); // 0
    StringDistance.GetDamerauLevenshteinDistance("abcde", "abcdz").Dump(); // 1
    StringDistance.GetDamerauLevenshteinDistance("abcde", "abcyz").Dump(); // 2
    StringDistance.GetDamerauLevenshteinDistance("abcde", "abxyx").Dump(); // 3
    StringDistance.GetDamerauLevenshteinDistance("abcde", "awxyz").Dump(); // 4
    StringDistance.GetDamerauLevenshteinDistance("abcde", "vwxyz").Dump(); // 5
    // Transpositions
    StringDistance.GetDamerauLevenshteinDistance("abcde", "bacde").Dump(); // 1
    StringDistance.GetDamerauLevenshteinDistance("abcde", "baced").Dump(); // 2

    All of the results are the same except for the transpositions which show one and two operations respectively.

    Parallelization Considerations

    You may have noticed that these algorithms (particularly the Levenshtein variations) don’t lend themselves well to parallelization due to their dependency on previous steps.  I did decide though that although the algorithms aren’t really parallelizable, matching against a known set certainly is.  Furthermore, parallelizing the matching is trivial with PLINQ.

    var distances =
        GetStringsToMatch()
        .AsParallel()
        .Select (
            x => new
            {
                Value = x,
                Distance = StringDistance.GetLevenshteinDistance("abcde", x)
            }
        );

    Wrapping Up

    We’ve looked at algorithms for calculating the Hamming distance, Levenshtein distance, and Damerau-Levenshtein distance.  In my project I’m leaning toward using the traditional Levenshtein distance since I shouldn’t have to consider transpositions  and the data isn’t guaranteed to be the same length.  I also decided to make each of these algorithms available through extension methods on the string class to make them more accessible to other developers.

    To those more knowledgeable about these things, are there other variations or techniques I should consider instead?

  • 相关阅读:
    18寒假第一测
    P1932 A+B A-B A*B A/B A%B Problem
    POJ 2823 Sliding Window
    堆排序
    [LeetCode] Candy
    [LeetCode] Single Number II
    [LeetCode] Single Number
    [LeetCode] Binary Tree Postorder Traversal
    [LeetCode] Binary Tree Preorder Traversal
    [LeetCode] Insertion Sort List
  • 原文地址:https://www.cnblogs.com/hun_dan/p/2594794.html
Copyright © 2011-2022 走看看