zoukankan      html  css  js  c++  java
  • Levenshtein Distance

     原文地址:http://www.dotnetperls.com/levenshtein

    Strings may be different yet very similar. With the Levenshtein distance algorithm, we measure similarity and match approximate strings with fuzzy logic. Many projects need this logic.

    Levenshtein distance computations
    
    Words:                ant, aunt
    Levenshtein distance: 1
    Note:                 Only 1 edit is needed.
    		      The 'u' must be added at index 2.
    
    Words:                Samantha, Sam
    Levenshtein distance: 5
    Note:                 The final 5 letters must be removed.
    
    Words:                Flomax, Volmax
    Levenshtein distance: 3
    Note:                 The first 3 letters must be changed
    		      Drug names are commonly confused.
    

      

      

    Levenshtein  in  t-sql

    SET QUOTED_IDENTIFIER ON 
    GO
    SET ANSI_NULLS ON 
    GO
    
    CREATE FUNCTION edit_distance_within(@s nvarchar(4000), @t nvarchar(4000), @d int)
    RETURNS int
    AS
    BEGIN
      DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
        @cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
      SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0
      WHILE @j <= @tl
        SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
      WHILE @i <= @sl
      BEGIN
        SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000
        WHILE @j <= @tl
        BEGIN
          SET @c = @c + 1
          SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END
          IF @c > @c1 SET @c = @c1
          SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
          IF @c > @c1 SET @c = @c1
          IF @c < @cmin SET @cmin = @c
          SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1
        END
        IF @cmin > @d BREAK
        SELECT @cv1 = @cv0, @i = @i + 1
      END
      RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
    END
    GO
    

      

    Levenshtein algorithm

    First, credit at the conceptual level goes to Vladimir Levenshtein, a Russian scientist. This code uses a two-dimensional array instead of a jagged array because the space required will only have one width and one height. The two-dimensional array requires fewer allocations upon the managed heap and may be faster in this context.

    Program that implements the algorithm [C#]
    using System;
    
    /// <summary>
    /// Contains approximate string matching
    /// </summary>
    static class LevenshteinDistance
    {
        /// <summary>
        /// Compute the distance between two strings.
        /// </summary>
        public static int Compute(string s, string t)
        {
    	int n = s.Length;
    	int m = t.Length;
    	int[,] d = new int[n + 1, m + 1];
    
    	// Step 1
    	if (n == 0)
    	{
    	    return m;
    	}
    
    	if (m == 0)
    	{
    	    return n;
    	}
    
    	// Step 2
    	for (int i = 0; i <= n; d[i, 0] = i++)
    	{
    	}
    
    	for (int j = 0; j <= m; d[0, j] = j++)
    	{
    	}
    
    	// Step 3
    	for (int i = 1; i <= n; i++)
    	{
    	    //Step 4
    	    for (int j = 1; j <= m; j++)
    	    {
    		// Step 5
    		int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
    
    		// Step 6
    		d[i, j] = Math.Min(
    		    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
    		    d[i - 1, j - 1] + cost);
    	    }
    	}
    	// Step 7
    	return d[n, m];
        }
    }
    
    class Program
    {
        static void Main()
        {
    	Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
    	Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
    	Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
        }
    }
    

      

    Output 1 5 3

    The Levenshtein method is static—this Compute method doesn't need to store state or instance data, which means you can declare it as static. This can also improve performance, avoiding callvirt instructions.

    Tip:You can verify that the above implementation is the standard version of Levenshtein by looking at one of the textbooks you were supposed to read.

    Static classes. This algorithm is stateless, which means it doesn't store instance data and therefore can be put in a static class. Static classes are easier to add to new projects than separate methods.

    Static Class

    Usage

    Continuing on, we see how you can call the method in your C# programs. You will often want to compare multiple strings with the Levenshtein algorithm. The example here shows how you can compare strings in a loop; we use a List of string[] arrays.

    Program that calls Levenshtein in loop [C#]
    static void Main()
    {
        List<string[]> l = new List<string[]>
        {
        new string[]{"ant", "aunt"},
        new string[]{"Sam", "Samantha"},
        new string[]{"clozapine", "olanzapine"},
        new string[]{"flomax", "volmax"},
        new string[]{"toradol", "tramadol"},
        new string[]{"kitten", "sitting"}
        };
    
        foreach (string[] a in l)
        {
        int cost = Compute(a[0], a[1]);
        Console.WriteLine("{0} -> {1} = {2}",
            a[0],
            a[1],
            cost);
        }
    }
    Output
    
    ant -> aunt = 1
    Sam -> Samantha = 5
    clozapine -> olanzapine = 3
    flomax -> volmax = 3
    toradol -> tramadol = 3
    kitten -> sitting = 3

    Resource

    You can visit an excellent page about the Levenshtein distance and many implementations of it. The page and its links provides a more detailed reference.

    Levenshtein Distance [External]

    Summary

    We saw the famous Levenshtein Distance algorithm, optimized for the C# language. This code implements approximate string matching. The difference between two strings is not represented as true or false, but as an integer indicating the number of steps needed to get from one to the other.

    As a reminder:The brilliance of the algorithm comes from Dr. Levenshtein.

    Algorithms

  • 相关阅读:
    [易学原创作品]QQ/MSN实现原理: C# UDP穿越NAT,UDP打洞,UDP Hole Punching,P2P
    工作流设计
    jquery 固定行列可编辑表格jfixed
    oracle 数据库实体生成工具
    c#做的查看sybase数据库表
    简单实用ORM框架
    Js 弹出框 返回值的两种常用方法
    数据结构 Jimmy
    js中的对象方法中this指向问题
    js中构造函数中this指向问题
  • 原文地址:https://www.cnblogs.com/hun_dan/p/2594872.html
Copyright © 2011-2022 走看看