zoukankan      html  css  js  c++  java
  • 字符串匹配算法

    概念明确:被匹配串S、匹配串P。如从cbabce找ab,前者和后者分别称为被匹配串、匹配串。设S长度为n、P长度为k

    暴力算法

    最容易想到的方法:从首字母开始,逐个比较下去。一旦发现有不同的字符就停止并将这个匹配串后移一位,然后从头开始进行下一次比较。这样,就需要将字串中的所有字符一一比较。

    KMP算法(1970)

    KMP:Knuth-Morris-Pratt,三个发明者的名字首字母

    基于的事实:不匹配时利用“部分匹配表”跳过尽可能多的无法匹配的位置。

    算法主要过程:预先根据P算出"部分匹配表";P从前往后移动与S进行匹配,每次匹配时从前往后依次对比字符,若遇到不一样的字符,则P的此字符之前的部分是匹配的,称为前缀子串,从部分匹配表查得已匹配的部分串(前缀子串)的部分匹配值,从而算得P应后移的位数:移动位数 = 前缀子串的字符数 - 查得的部分匹配值

    部分匹配值:前缀真子串和后缀真子串的最长的共有元素的长度。如ABD的前缀真子串有A、AB,后缀真子串有BD、D,其最长共有元素长度为0,故ABD的部分匹配值为0。

    部分匹配表:对P每个前缀子串求部分匹配值,就得到P的部分匹配表。

    "部分匹配"的实质:有时候,字符串头部和尾部会有重复。比如,"ABCDAB"之中有两个"AB",那么它的"部分匹配值"就是2("AB"的长度)。搜索词移动的时候,第一个"AB"向后移动4位(字符串长度-部分匹配值),就可以来到第二个"AB"的位置。

    参阅:http://www.ruanyifeng.com/blog/2013/05/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm.html

    java代码实现(来自 https://algs4.cs.princeton.edu/53substring/KMP.java.html):

    KMP.java
    
    Below is the syntax highlighted version of KMP.java from §5.3 Substring Search.
    
    
    /******************************************************************************
     *  Compilation:  javac KMP.java
     *  Execution:    java KMP pattern text
     *  Dependencies: StdOut.java
     *
     *  Reads in two strings, the pattern and the input text, and
     *  searches for the pattern in the input text using the
     *  KMP algorithm.
     *
     *  % java KMP abracadabra abacadabrabracabracadabrabrabracad
     *  text:    abacadabrabracabracadabrabrabracad 
     *  pattern:               abracadabra          
     *
     *  % java KMP rab abacadabrabracabracadabrabrabracad
     *  text:    abacadabrabracabracadabrabrabracad 
     *  pattern:         rab
     *
     *  % java KMP bcara abacadabrabracabracadabrabrabracad
     *  text:    abacadabrabracabracadabrabrabracad 
     *  pattern:                                   bcara
     *
     *  % java KMP rabrabracad abacadabrabracabracadabrabrabracad 
     *  text:    abacadabrabracabracadabrabrabracad
     *  pattern:                        rabrabracad
     *
     *  % java KMP abacad abacadabrabracabracadabrabrabracad
     *  text:    abacadabrabracabracadabrabrabracad
     *  pattern: abacad
     *
     ******************************************************************************/
    
    /**
     *  The {@code KMP} class finds the first occurrence of a pattern string
     *  in a text string.
     *  <p>
     *  This implementation uses a version of the Knuth-Morris-Pratt substring search
     *  algorithm. The version takes time proportional to <em>n</em> + <em>m R</em>
     *  in the worst case, where <em>n</em> is the length of the text string,
     *  <em>m</em> is the length of the pattern, and <em>R</em> is the alphabet size.
     *  It uses extra space proportional to <em>m R</em>.
     *  <p>
     *  For additional documentation,
     *  see <a href="https://algs4.cs.princeton.edu/53substring">Section 5.3</a> of
     *  <i>Algorithms, 4th Edition</i> by Robert Sedgewick and Kevin Wayne.
     */
    public class KMP {
        private final int R;       // the radix
        private int[][] dfa;       // the KMP automoton
    
        private char[] pattern;    // either the character array for the pattern
        private String pat;        // or the pattern string
    
        /**
         * Preprocesses the pattern string.
         *
         * @param pat the pattern string
         */
        public KMP(String pat) {
            this.R = 256;
            this.pat = pat;
    
            // build DFA from pattern
            int m = pat.length();
            dfa = new int[R][m]; 
            dfa[pat.charAt(0)][0] = 1; 
            for (int x = 0, j = 1; j < m; j++) {
                for (int c = 0; c < R; c++) 
                    dfa[c][j] = dfa[c][x];     // Copy mismatch cases. 
                dfa[pat.charAt(j)][j] = j+1;   // Set match case. 
                x = dfa[pat.charAt(j)][x];     // Update restart state. 
            } 
        } 
    
        /**
         * Preprocesses the pattern string.
         *
         * @param pattern the pattern string
         * @param R the alphabet size
         */
        public KMP(char[] pattern, int R) {
            this.R = R;
            this.pattern = new char[pattern.length];
            for (int j = 0; j < pattern.length; j++)
                this.pattern[j] = pattern[j];
    
            // build DFA from pattern
            int m = pattern.length;
            dfa = new int[R][m]; 
            dfa[pattern[0]][0] = 1; 
            for (int x = 0, j = 1; j < m; j++) {
                for (int c = 0; c < R; c++) 
                    dfa[c][j] = dfa[c][x];     // Copy mismatch cases. 
                dfa[pattern[j]][j] = j+1;      // Set match case. 
                x = dfa[pattern[j]][x];        // Update restart state. 
            } 
        } 
    
        /**
         * Returns the index of the first occurrrence of the pattern string
         * in the text string.
         *
         * @param  txt the text string
         * @return the index of the first occurrence of the pattern string
         *         in the text string; N if no such match
         */
        public int search(String txt) {
    
            // simulate operation of DFA on text
            int m = pat.length();
            int n = txt.length();
            int i, j;
            for (i = 0, j = 0; i < n && j < m; i++) {
                j = dfa[txt.charAt(i)][j];
            }
            if (j == m) return i - m;    // found
            return n;                    // not found
        }
    
        /**
         * Returns the index of the first occurrrence of the pattern string
         * in the text string.
         *
         * @param  text the text string
         * @return the index of the first occurrence of the pattern string
         *         in the text string; N if no such match
         */
        public int search(char[] text) {
    
            // simulate operation of DFA on text
            int m = pattern.length;
            int n = text.length;
            int i, j;
            for (i = 0, j = 0; i < n && j < m; i++) {
                j = dfa[text[i]][j];
            }
            if (j == m) return i - m;    // found
            return n;                    // not found
        }
    
    
        /** 
         * Takes a pattern string and an input string as command-line arguments;
         * searches for the pattern string in the text string; and prints
         * the first occurrence of the pattern string in the text string.
         *
         * @param args the command-line arguments
         */
        public static void main(String[] args) {
            String pat = args[0];
            String txt = args[1];
            char[] pattern = pat.toCharArray();
            char[] text    = txt.toCharArray();
    
            KMP kmp1 = new KMP(pat);
            int offset1 = kmp1.search(txt);
    
            KMP kmp2 = new KMP(pattern, 256);
            int offset2 = kmp2.search(text);
    
            // print results
            StdOut.println("text:    " + txt);
    
            StdOut.print("pattern: ");
            for (int i = 0; i < offset1; i++)
                StdOut.print(" ");
            StdOut.println(pat);
    
            StdOut.print("pattern: ");
            for (int i = 0; i < offset2; i++)
                StdOut.print(" ");
            StdOut.println(pat);
        }
    }
    
    
    Copyright © 2000–2017, Robert Sedgewick and Kevin Wayne.
    Last updated: Tue Feb 6 02:05:56 EST 2018.
    View Code

    Boyer-Moore算法(1977)

    基于的事实:对于每一次失败的匹配尝试,跳过尽可能多的无法匹配的位置。

    算法主要过程:P从前往后移动与S进行匹配,每次匹配时从后往前依次对比字符,若遇到不一样的字符(假设S中的字符为c)则在P尚未比较的剩下字符中从后往前找字符c出现的位置:1、若找不到则P后移到c之后进行下一次匹配;2、否则后移P使得c与该位置对齐。实际上,若在遇到不匹配字符时有部分后缀匹配了(称为“好后缀”),则可利用这后缀信息,以在有些情况下可以跳过更多位置(实际上不用“好后缀”也是可以得到结果的),可参阅后面所列的文章。

    复杂度:O(n+k),且k越大(即搜索串)越长,速度越快,因为能跳过越多的无无法匹配的字符从而减少比较次数

    参阅:http://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html

  • 相关阅读:
    LeetCode 【1】 Two Sum --001
    计算机网络-自定向下方法之应用层
    Android TextView自动换行、排列错乱问题及解决
    Android ScrollView内部组件设置android:layout_height="fill_parent"无效的解决办法
    Gson解析空字符串异常的处理
    AndroBench手机性能测试
    使用adb命令通过IP地址连接手机
    Android string.xml 添加特殊字符
    Android string资源 包含 数学符号等特殊字符 及 参数占位符
    Android Studio多渠道打包(二)
  • 原文地址:https://www.cnblogs.com/z-sm/p/11934551.html
Copyright © 2011-2022 走看看