zoukankan      html  css  js  c++  java
  • How to calculate distance similarity measure of given 2 strings?

    How to calculate distance similarity measure of given 2 strings?

    问题

    I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example:

    • The real word: hospital
    • Mistaken word: haspita

    Now my aim is to determine how many characters I need to modify the mistaken word to obtain the real word. In this example, I need to modify 2 letters. So what would be the percent? I take the length of the real word always. So it becomes 2 / 8 = 25% so these 2 given string DSM is 75%.

    How can I achieve this with performance being a key consideration?

    回答1

    I just addressed this exact same issue a few weeks ago. Since someone is asking now, I'll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x - 100x +. Note a couple key points for performance:

    • If you need to compare the same words over and over, first convert the words to arrays of integers. The Damerau-Levenshtein algorithm includes many >, <, == comparisons, and ints compare much faster than chars.
    • It includes a short-circuiting mechanism to quit if the distance exceeds a provided maximum
    • Use a rotating set of three arrays rather than a massive matrix as in all the implementations I've see elsewhere
    • Make sure your arrays slice accross the shorter word width.

    Code (it works the exact same if you replace int[] with String in the parameter declarations:

    /// <summary>
    /// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
    /// integers, where each integer represents the code point of a character in the source string.
    /// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
    /// </summary>
    /// <param name="source">An array of the code points of the first string</param>
    /// <param name="target">An array of the code points of the second string</param>
    /// <param name="threshold">Maximum allowable distance</param>
    /// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
    public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
    
        int length1 = source.Length;
        int length2 = target.Length;
    
        // Return trivial case - difference in string lengths exceeds threshhold
        if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
    
        // Ensure arrays [i] / length1 use shorter length 
        if (length1 > length2) {
            Swap(ref target, ref source);
            Swap(ref length1, ref length2);
        }
    
        int maxi = length1;
        int maxj = length2;
    
        int[] dCurrent = new int[maxi + 1];
        int[] dMinus1 = new int[maxi + 1];
        int[] dMinus2 = new int[maxi + 1];
        int[] dSwap;
    
        for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
    
        int jm1 = 0, im1 = 0, im2 = -1;
    
        for (int j = 1; j <= maxj; j++) {
    
            // Rotate
            dSwap = dMinus2;
            dMinus2 = dMinus1;
            dMinus1 = dCurrent;
            dCurrent = dSwap;
    
            // Initialize
            int minDistance = int.MaxValue;
            dCurrent[0] = j;
            im1 = 0;
            im2 = -1;
    
            for (int i = 1; i <= maxi; i++) {
    
                int cost = source[im1] == target[jm1] ? 0 : 1;
    
                int del = dCurrent[im1] + 1;
                int ins = dMinus1[i] + 1;
                int sub = dMinus1[im1] + cost;
    
                //Fastest execution for min value of 3 integers
                int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
    
                if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
                    min = Math.Min(min, dMinus2[im2] + cost);
    
                dCurrent[i] = min;
                if (min < minDistance) { minDistance = min; }
                im1++;
                im2++;
            }
            jm1++;
            if (minDistance > threshold) { return int.MaxValue; }
        }
    
        int result = dCurrent[maxi];
        return (result > threshold) ? int.MaxValue : result;
    }

     Where Swap is:

    static void Swap<T>(ref T arg1,ref T arg2) {
        T temp = arg1;
        arg1 = arg2;
        arg2 = temp;
    }
    

    回答2

     https://gist.github.com/Davidblkx/e12ab0bb2aff7fd8072632b396538560

    莱文斯坦距离,又称Levenshtein距离,是编辑距离的一种。指两个字串之間,由一个转成另一个所需的最少编辑操作次数。 

    What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.

    Here's an implementation from the first site linked below:

    private static int  CalcLevenshteinDistance(string a, string b)
        {
        if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
            return 0;
        }
        if (String.IsNullOrEmpty(a)) {
            return b.Length;
        }
        if (String.IsNullOrEmpty(b)) {
            return a.Length;
        }
        int  lengthA   = a.Length;
        int  lengthB   = b.Length;
        var  distances = new int[lengthA + 1, lengthB + 1];
        for (int i = 0;  i <= lengthA;  distances[i, 0] = i++);
        for (int j = 0;  j <= lengthB;  distances[0, j] = j++);
    
        for (int i = 1;  i <= lengthA;  i++)
            for (int j = 1;  j <= lengthB;  j++)
                {
                int  cost = b[j - 1] == a[i - 1] ? 0 : 1;
                distances[i, j] = Math.Min
                    (
                    Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
                    distances[i - 1, j - 1] + cost
                    );
                }
        return distances[lengthA, lengthB];
        }
    
  • 相关阅读:
    angularjs1-8,cacheFactory,sce
    angularjs1-7,http,location
    angularjs1-7,供应商
    angularjs1-6,自定义服务
    UI设计师不可不知的安卓屏幕知识-安卓100分享
    k8s ingres 的安装与使用
    get、put、post、delete含义与区别
    [影像技术与PACS] 从技术角度看国内部份PACS厂商
    UML类图符号 各种关系说明以及举例
    理解 Delphi 的类(八)
  • 原文地址:https://www.cnblogs.com/chucklu/p/15469406.html
Copyright © 2011-2022 走看看