zoukankan      html  css  js  c++  java
  • [笔记]Delphi实现获取字符串相似度

    维基百科对字符串相似度(Damerau–Levenshtein distance)的定义是:

    In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a "distance" (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or atransposition of two adjacent characters. In his seminal paper[1], Damerau not only distinguished these four edit operations but also stated that they correspond to more than 80% of all human misspellings. Damerau's paper considered only misspellings that could be corrected with at most one edit operation. The corresponding edit distance, i.e., dealing with multiple edit operations, known as the Levenshtein distance, was introduced by Levenshtein,[2] but it did not include transpositions in the set of basic operations. The name Damerau–Levenshtein distance is used to refer to the edit distance that allows multiple edit operations including transpositions, although it is not clear whether the term Damerau–Levenshtein distanceis sometimes used in some sources as to take into account non-adjacent transpositions or not.

    简单翻译下,两个字符序列的DL距离,就是从一个变换到另一个的最小操作次数。这个变换包括插入一个字符删除一个字符替换一个字符、或互换两个相邻字符

    而所谓“编辑距离(edit distance,或叫Levenshtein distance)”,并不包含互换两个相邻字符

    主要应用是在字符拼写检查上,当然也可以用在其他地方,比方不少输入法就提供类似的校正功能(搜狗拼音输入法即实现了此功能)。

    看起来简单,实现还是有一定困难的,好在有牛人已经做好相应的函数,如 KambizHow to match two strings approximately 中提供了两个函数:

    计算DL距离的函数DamerauLevenshteinDistance(Str1, Str2)

    function DamerauLevenshteinDistance(const Str1, Str2: string): Integer;
    var
      LenStr1, LenStr2: Integer;
      I, J, T, Cost, Minimum: Integer;
      pStr1, pStr2, S1, S2: PChar;
      D, RowPrv2, RowPrv1, RowCur, Temp: PIntegerArray;
    begin
      LenStr1 := Length(Str1);
      LenStr2 := Length(Str2);
    
      // to save some space, make sure the second index points to the shorter string
      if LenStr1 < LenStr2 then begin
        T := LenStr1;
        LenStr1 := LenStr2;
        LenStr2 := T;
        pStr1 := PChar(Str2);
        pStr2 := PChar(Str1);
      end
      else begin
        pStr1 := PChar(Str1);
        pStr2 := PChar(Str2);
      end;
    
      // to save some time and space, look for exact match
      while (LenStr2 <> 0) and (pStr1^ = pStr2^) do begin
        Inc(pStr1);
        Inc(pStr2);
        Dec(LenStr1);
        Dec(LenStr2);
      end;
    
      // when one string is empty, length of the other is the distance
      if LenStr2 = 0 then begin
        Result := LenStr1;
        Exit;
      end;
    
      // calculate the edit distance
      T := LenStr2 + 1;
      GetMem(D, 3 * T * SizeOf(Integer));
      FillChar(D^, 2 * T * SizeOf(Integer), 0);
      RowCur := D;
      RowPrv1 := @D[T];
      RowPrv2 := @D[2 * T];
      S1 := pStr1;
    
      for I := 1 to LenStr1 do begin
        Temp := RowPrv2;
        RowPrv2 := RowPrv1;
        RowPrv1 := RowCur;
        RowCur := Temp;
        RowCur[0] := I;
        S2 := pStr2;
    
        for J := 1 to LenStr2 do begin
          Cost := Ord(S1^ <> S2^);
          Minimum := RowPrv1[J - 1] + Cost;                 // substitution
          T := RowCur[J - 1] + 1;                           // insertion
    
          if T < Minimum then Minimum := T;
    
          T := RowPrv1[J] + 1;                              // deletion
    
          if T < Minimum then Minimum := T;
    
          if (I <> 1) and (J <> 1) and (S1^ = (S2 - 1)^) and (S2^ = (S1 - 1)^) then begin
            T := RowPrv2[J - 2] + Cost;                     // transposition
    
            if T < Minimum then Minimum := T;
          end;
    
          RowCur[J] := Minimum;
          Inc(S2);
        end;
    
        Inc(S1);
      end;
    
      Result := RowCur[LenStr2];
      FreeMem(D);
    end;

    还有计算字符串相似度的函数 StringSimilarityRatio(Str1, Str2, IgnoreCase): Double;

    返回值在0到1之间,0表示不相似,1表示完全相似。

    function StringSimilarityRatio(const Str1, Str2: string; IgnoreCase: Boolean): Double;
    var
      MaxLen: Integer;
      Distance: Integer;
    begin
      Result := 1.0;
    
      if Length(Str1) > Length(Str2) then
        MaxLen := Length(Str1)
      else
        MaxLen := Length(Str2);
    
      if MaxLen <> 0 then begin
        if IgnoreCase then
          Distance := DamerauLevenshteinDistance(LowerCase(Str1), LowerCase(Str2))
        else
          Distance := DamerauLevenshteinDistance(Str1, Str2);
    
        Result := Result - (Distance / MaxLen);
      end;
    end;

    后来data man 参考一个德国人的ApproxStrUtils单元(该单元计算的是L距离,不是DL距离)给出一个据说效率更高的DL距离函数遗憾的是调用它会有“Invalid Pointer Operation”的报错,还没有Debug出问题所在,所以暂时先用前一个版本吧。

    function DamerauLevenshteinDistance2(const Str1, Str2: string): Integer;
      function Min(const A, B, C: Integer): Integer; inline;
      begin
        Result := A;
        if B < A then
          Result := B;
        if C < Result then
          Result := C;
      end;
    
    var
      LenStr1, LenStr2: Integer;
      I, J, T, Cost, PrevCost: Integer;
      pStr1, pStr2, S1, S2: PChar;
      D: PIntegerArray;
    begin
      LenStr1 := Length(Str1);
      LenStr2 := Length(Str2);
    
      // to save some space, make sure the second index points to the shorter string
      if LenStr1 < LenStr2 then begin
        T := LenStr1;
        LenStr1 := LenStr2;
        LenStr2 := T;
        pStr1 := PChar(Str2);
        pStr2 := PChar(Str1);
      end
      else begin
        pStr1 := PChar(Str1);
        pStr2 := PChar(Str2);
      end;
    
      // to save some time and space, look for exact match
      while (LenStr2 <> 0) and (pStr1^ = pStr2^) do begin
        Inc(pStr1);
        Inc(pStr2);
        Dec(LenStr1);
        Dec(LenStr2);
      end;
    
      while (LenStr2 <> 0) and ((pStr1 + LenStr1 - 1)^ = (pStr2 + LenStr2 - 1)^) do begin
        Dec(LenStr1);
        Dec(LenStr2);
      end;
    
      if LenStr2 = 0 then begin
        Result := LenStr1;
        Exit;
      end;
    
      // calculate the edit distance
      T := LenStr2 + 1;
      GetMem(D, T * SizeOf(Integer));
    
      for I := 0 to T do D[I] := I;
    
      S1 := pStr1;
      for I := 1 to LenStr1 do begin
        PrevCost := I - 1;
        Cost := I;
        S2 := pStr2;
    
        for J := 1 to LenStr2 do begin
          if (S1^ = S2^) or ((I > 1) and (J > 1) and (S1^ = (S2 - 1)^) and (S2^ = (S1 - 1)^)) then
            Cost := PrevCost
          else
            Cost := 1 + min(Cost, PrevCost, D[J]);
    
          PrevCost := D[J];
          D[J] := Cost;
          Inc(S2);
        end;
    
        Inc(S1);
      end;
    
      Result := D[LenStr2];
      FreeMem(D);
    end;

    参考文献:

    1. Damerau–Levenshtein_distance
      http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
    2. How to match two strings approximately
      http://www.delphiarea.com/articles/how-to-match-two-strings-approximately/
    3. Fuzzy string matching
      www.delphiarea.com/articles/how-to-match-two-strings-approximately
    4. Fuzzy search in strings
      http://www.gausi.de/approxstrutils-en.html
  • 相关阅读:
    动态代理练习2全站压缩流输出[response动态代理]
    动态代理练习3自定义数据库连接池[connection动态代理]
    类加载器
    Mysql中的数据类型对应于Java中什么数据类型
    使用网上流传的一个数据库连接池在Proxy.newProxyInstance处引起 java.lang.ClassCastException 问题的解决方法
    动态代理练习1全站字符编码过滤[request动态代理]
    用cocos2dx将helloworld实现出来了
    多线程的自动管理(线程池)
    互斥对象
    多线程的自动管理(定时器)
  • 原文地址:https://www.cnblogs.com/journeyonmyway/p/2112750.html
Copyright © 2011-2022 走看看