32 String Matching
32.1-2
Suppose that all characters in the pattern P are different. Show how to accelerate NAIVE-STRING-MATCHER to run in time O.n/ on an n-character text T.
Naive-Search(T,P)
for s = 1 to n – m + 1
j = 0
while T[s+j] == P[j] do
j = j + 1
if j = m return s
s = j + s;
该算法实际只是会扫描整个字符串的每个字符一次,所以其时间复杂度为O(n).
31.1-3
Suppose that pattern P and text T are randomly chosen strings of length m and n, respectively, from the d-ary alphabet ∑d = {0,1,2,..,d-1},where d ≧ 2.Show that the expected number of character-to-character comparisons made by the implicit loop in line 4 of the naive algorithm is
over all executions of this loop. (Assume that the naive algorithm stops comparing characters for a given shift once it finds a mismatch or matches the entire pattern.) Thus, for randomly chosen strings, the naive algorithm is quite efficient.
当第4行隐含的循环执行i次时,其概率P为:
- P = 1/Ki-1 * (1-1/k), if i < m
- P = 1/Km-1 * (1-1/k) + 1/Km , if i = m
可以计算每次for循环迭代时,第4行的循环的平均迭代次数为:
[1*(1-1/k)+2*(1/K)*(1-1/k)+3*(1/k2)(1-1/k)+…+(m-1)*(1-km-2)(1-1/k) +m*(1/km-1)(1-1/k) + m*(1/km)]
= 1 - 1/k + 2/k - 2/k2 + 3/k2 - 3/k3 +...+ m/km-1 - m/km + m/km
= 1 + 1/k + 1/k2 +...+ 1/km-1
= (1 - 1/Km) / (1 - 1/k)
≤ 2
所以,可知,第4行循环的总迭代次数为:(n-m+1) * [(1-1/Km) / (1-1/k)] ≤ 2 (n-m+1)
31.1-4
Suppose we allow the pattern P to contain occurrences of a gap character } that can match an arbitrary string of characters (even one of zero length). For example, the pattern ab}ba}c occurs in the text cabccbacbacab as
and as
Note that the gap character may occur an arbitrary number of times in the pattern but not at all in the text. Give a polynomial-time algorithm to determine whether such a pattern P occurs in a given text T, and analyze the running time of your algorithm.
该算法只是要求判断是否模式P出现在该字符串中,那么问题被简化了许多。对于该问题而言,我们可以模式P中的gap为分隔符,将原字符串分解为多个子字符串{P1,P2,...},而后,在T中依次寻找这些字符串,必须保证Pi+1在Pi之后。其伪代码如下:
Gap-Naive-Search(T,P)
n = T.length
m = P.length
i = 0;
j = 0;
while(i ≦ n)
//直接删去下一个字符串前的gap字符
while(i ≦ m && P[i] == gap)
i++;
if i > m return true;
//找到下一个需要进行匹配的子串
k = 0;
while(P[i+k] != gap)
k++;
s = Naive-Search(T[j..n],P[i..i+k-1]);
if s == -1 return false
i = i + s;
j = j + k;
Naive-Search(T,P)
n = T.length;
m = P.length;
for s = 1 to n – m + 1
j = 0
while T[s+j] == P[j] do
j = j + 1;
if j = m return s
return -1
对于该算法的时间复杂度分析,对于最外层循环中嵌套的两个while循环,在整个算法执行过程中,其实际上只是遍历了字符串T一次,可以二者的总时间复杂度为O(n).至于其中对于函数Naive-Search(T,P)的调用,可以观察到在每次调用Naive-Search(T,P)中,其比较次数为:(n-j-k+1)k,而所有的调用Naive-Search(T,P)的时间复杂度∑( n-j-ki+1)ki < n∑ki < nm,其时间复杂度为O(mn).故其总时间复杂度为O(n) + O(mn) = O(mn).
32.2-1
Working modulo q = 11, howmany spurious hits does the Rabin-Karp matcher encounter in the text T = 3141592653589793 when looking for the pattern P = 26?
有以下程序即可计算得到,最终,valid hit:1,spurious hit:3
char P[17] = "3141592653589793";
int m = 16;
int q = 11;
int n = 10 % q;
int j = ((2 * n) + 6 )% q;
int count1 = 0;
int count2 = 0;
for (int s = 0; s < m ; s++) {
int sum = ((P[s] - '0') * n + (P[s + 1] - '0') )% q;
if( sum == j){
if (P[s] == '2' && P[s + 1] == '6')
count1++;
else
count2++;
}
}
printf("valid hit:%d,spurious hit:%d", count1,count2);
32.2-2
How would you extend the Rabin-Karp method to the problem of searching a text string for an occurrence of any one of a given set of k patterns? Start by assuming that all k patterns have the same length. Then generalize your solution to allow the patterns to have different lengths.
匹配k个相同长度的模式P
Rabin-Karp-Search(T[1...n],P[1...k][1...m],d)
q = a prime larger than m;
c = d^(m-1) mod q; // run a loop multiplying by 10 mod q
for i = 1 to k
fp[i] = 0;
ft = 0;
for i = 1 to m // preprocessing
ft = (d*ft + T[i]) mod q;
for j = 1 to k
fp[j] = (d*fp[j] + P[j][i]) mod q;
for s = 0 to n – m // matching
for j = 1 to k
if fp[j] = ft // run a loop to compare strings
if P[j][1..m] = T[s+1..s+m]
print “Pattern:P[j] occurs with shift” s;
if s < n-m
ft = ((ft – T[s]*c)*d + T[s + m + 1]) mod q;
O(k) + O(mk) + O(km(n-m+1)) = O(km(n-m+1))
匹配k个不同长度的模式P
Rabin-Karp-Search(T[1...n],P[1...k][1...m],d)
q = a prime larger than m;
for i = 1 to k
m[i] = P[i],length;
for i = 1 to k
c[i] = d^(m[i]-1) mod q; // run a loop multiplying by d mod q
for i = 1 to k
fp[i] = 0;
ft[j] = 0;
for i = 1 to k // preprocessing
for j = 1 to m[i]
ft[i] = (d*ft[i] + T[j]) mod q;
fp[i] = (d*fp[i] + P[i][j]) mod q;
for s = 0 to n – m // matching
for j = 1 to k
if fp[j] = ft[j] // run a loop to compare strings
if P[j][1..m[j]] = T[s+1..s+m[j]]
print “Pattern:P[j] occurs with shift” s;
for i = 1 to k
if s < n - m[i]
ft[i] = ((ft[i] – T[s]*c)*d + T[s + m[i] + 1]) mod q;
O(k) + O(k) + O(k) +O(k * max{P[1..k].length}) + O(k * max{P[1..k].length} * (n-m)) = O(k * max{P[1..k].length} * (n-m))
32.2-3 使用hash函数解
Show how to extend the Rabin-Karp method to handle the problem of looking for agiven m * m pattern in an n * n array of characters. (The pattern may be shifted vertically and horizontally, but it may not be rotated.)
这里可以适用哈希函数进行整个矩阵的表示
该算法采用的总体思想是,将一个m*m的矩阵块分割为m行进行处理,每一行都采用Rabin-Karp-Search算法的函数处理模式。而处理顺序为:一行一行地处理,此时就可以利用相关信息进行递推运算,对于列,则重新计算。
Rabin-Karp-Search(T[1...n][1...n],P[1...m][1...m],d)
q = a prime larger than m;
c = d^(m-1) mod q; // run a loop multiplying by 10 mod q
for s1 = 0 to n - m // 控制列的遍历
for i = 1 to m
fp[i] = 0;
ft[i] = 0;
for i = 1 to m // preprocessing
for j = 1 to m
fp[i] = (d*fp[i] + P[s1+i][j]) mod q;
ft[i] = (d*ft[i] + T[s1+i][j]) mod q;
for s2 = 0 to n – m // matching
if fp[1...m] = ft[1..m] // run a loop to compare strings
if P[1...m][1...m] = T[s1+1...s1+m][s2+1..s2+m]
print “Pattern occurs with shift” s1 s2
if s2 < n-m
for i = 0 to m
ft[i] = ((ft[i] – T[s2]*c)*d + T[s1+i][s2 + m + 1]) mod q;
m * { O(m) + O(m2) + O((n-m) * (m*m2 )+m))} = O(m3 (n-m))
32.2-4 未完成
32.3-1
Construct the string-matching automaton for the pattern P = aabab and illustrate its operation on the text string T = aaababaabaababaab.
the transition function
q | a | b |
---|---|---|
0 | 1 | 0 |
1 | 2 | 0 |
2 | 2 | 3 |
3 | 4 | 0 |
4 | 2 | 5 |
5 | 1 | 0 |
operation
a | a | a | b | a | b | a | a | b | a | a | b | a | b | a | a | b | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 2 | 3 | 4 | 5 | 1 | 2 | 3 |
32.3-2
Draw a state-transition diagram for a string-matching automaton for the pattern ababbabbababbababbabb over the alphabet ∑ = {a,b}.
给出计算转换函数的代码:
char P[23] = " ababbabbababbababbabb";
char Q[4] = " ab";
int A[22][3];
int m = 21;
int n = 2;
int sum = 0;
for (int q = 0; q <= m ; q++) {
for (int p = 1; p <= n; p++) {
int k = ((m + 1) < (q + 2)) ? m + 1 : q + 2;
do {
k--;
sum = 0;
//比较P[1..k]与P[q-k+2..q]a
int i = 1;
while (i <= k - 1 && P[i] == P[q - k + i + 1])
i++;
if (i == k && P[k] == Q[p]) {
sum = k;
break;
}
} while (k > 1);
A[q][p] = sum;
}
}
32.3-3
We call a pattern P nonoverlappable if Pk = Pq implies k = 0 or k = q.Describe the state-transition diagram of the string-matching automaton for a nonoverlappable pattern.
The state transition function looks like a straight line, with all other edgesgoing back to either the initial vertex (if it is not the first letter of the pattern) orthe second vertex (if it is the first letter of the pattern). If it were to go back toany later state, that would mean that some suffix of what we had constructedso far(which was a prefix of P) was a prefix of the copy of P that we are nexttrying to find.
32.3-4
Given two patterns P and P', describe how to construct a finite automaton that determines all occurrences of either pattern. Try to minimize the number of states in your automaton.
其对于这个问题的描述,刚开始,个人不是很理解,如果它仅仅是要求寻找到两个模式之一的全部匹配,那么完全可以适用其中某个模式的有穷自动机来完成这个问题。那么这道题就没有任何的存在意义。
这是网络上的一个题解:
We can construct the automaton as follows: Let Pk be the longest prefix which both P and P' have in common. Create the automaton F for Pk as usual. Add an arrow labeled P[k+1] from state k to a chain of states k+1, k+2, . . . , |P|,and draw the appropriate arrows, so that δ(q, a) = σ(Pka). Next, add an arrow labeled P'[k + 1] from state k to a chain of states (k + 1)',(k + 2)', . . . , |P'|'.Draw the appropriate arrows so that δ(q, a) = σ(P'ka).
如果状态k经过字符a转向状态k+1(匹配P),经过字符b转向状态(k+1)'(匹配P'),那么在P的有穷自动机中,状态k经过字符b指向了何处,这就会导致对于模式P匹配的缺失。同理也造成了对于模式P'匹配的缺失。可以举一个例子为:P = abaab,P' = ababa,T = ababaab,其没有匹配ab[abaab],匹配成功[ababa]ab。
上述说明该自动机的特点:如果某字符串同时存在多个可以匹配P或P'的子串,且这些子串之间存在重叠,那么仅仅会识别一个;如果仅存在一个可以匹配P或P'的子串,那么必然可以识别。这大概是这道题目的意思。
同时,对于某些题解将P和P'的后缀也联系起来,是很难完成。即对于后缀的状态而言,虽然匹配成功时相同,但匹配失败时,未必相同。这里就存在具体的细节问题,如果它返回的状态可以P和P'中不相同的那部分状态,那么究竟应该返回哪一个?这里就会造成某个字符串匹配的缺失。(这里并非是由重叠引起的匹配缺失,而极有可能是单独字符串引起的缺失。这是不可以容忍的。)
32.3-5
Given a pattern P containing gap characters (see Exercise 32.1-4), show how to build a finite automaton that can find an occurrence of P in a text T in O(n) matching time, where n = (T).
对于这一问题,同样采用32.1-4中的判断思路,参见 解题思路.对于有穷自动机而言,其构建更加简单,只需要分别构建匹配P中间隔子串的有穷自动机,而后依照原序将之相连。
连接的具体细节为:依序相邻两个间隔子串A,B,A的接受状态的所有转换都将指向B的起始状态。
32.4-1
Compute the prefix function π for the pattern ababbabbabbababbabb.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 0 | 1 | 2 | 0 | 1 | 2 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
这里可以采用三种方法,依据π的定义。依据算法手动推导,直接写出算法来运行。
1. char P[21] = " ababbabbabbababbabb";
2. int A[20];
3. A[1] = 0;
4. int k = 0;
5. int m = 19;
6. for (int q = 2; q <= m; q++) {
7. while (k > 0 && P[k + 1] != P[q])
8. k = A[k];
9. if (P[k + 1] == P[q])
10. k++;
11. A[q] = k;
12. }
32.4-2
Give an upper bound on the size of π*(q) as a function of q. Give an example to show that your bound is tight.
the upper bound is the length of P.
P:aaaaaaaaaa...a, the length of P is m.
π*(m) = {0,1,2,3,...,m-1}
32.4-3
Explain how to determine the occurrences of pattern P in the text T by examining the π function for the string PT (the string of length m+n that is the concatenation of P and T).
{q - m | m ∈ π*(q) && q >= 2m }, m = P.length.
至于这里的判断条件必须为m ∈ π*(q) 而不是 π(q) >= m
或者m = π(q)。这两种判断情况无法完全找出所有的可以与P匹配的字符串,例如
- 针对m = π(q), P = aa,T = aaaaaaa.
- 针对π(q) >= m, P = ab,T = ababababab.
32.4-4
Use an aggregate analysis to show that the running time of KMP-MATCHER is O(n).
KMP算法代码
1. KMP-Matcher(T,P)
2. n = T.length;
3. m = P.length;
4. π = COMPUTE-PREFIX-FUNTION(p);
5. q = 0;
6. For i = 1 to n /*scan the text from left to right*/
7. while q > 0 and P[q+1] != T[i]
8. q = π[q];
9. IF P[q+1] == T[i]
10. q++;
11. IF q == m
12. print "Pattern occurs with shift" i-m;
13. q = π[q];
- 在第6行的for循环开始迭代前,q=0,而在第6行的for循环中,仅仅只有第9~10行的if语句可以使得q = q+1,也就是说,q在for循环的每一次迭代中,至多加1。q值至多为n。
- COMPUTE-PREFIX-FUNCTION(P)算法中,不管从代码角度(算法时间复杂度证明),或者从语义角度(算法正确性证明),都可以得出 π(q) < q,则,第7~8行的while循环将会减小q值,第13行的赋值也会降低q值。
- 而k值非负。可以得出的结论是q值至多下降n次,即在5-10行的for循环的全部迭代中,第7~8行的while循环至多执行n次。
那么该算法的时间复杂度为O(n) + O(n) = O(n)。
32.4-5 未完成
Use a potential function to show that the running time of KMP-MATCHER is Θ(n).
32.4-6
Show how to improve KMP-MATCHER by replacing the occurrence of π in line 7 (but not line 12) by π',where 'π is defined recursively for q = 1,2,...,m-1 by the equation
Explain why the modified algorithm is correct, and explain in what sense this change constitutes an improvement.
π'仅仅在第7行替换了π,如果在第6~7行的循环迭代完成后,q在修改前后不变即可证明正确性。
第6~7行while循环的作用是:当匹配P[q+1]!=T[i]时,寻找最大的k满足:k < q && Pk是Pq的后缀,同时,要求P[k+1] == T[i]。从实现角度来讲,就是递减的遍历π*[q],寻找P[k+1] == T[i]。
而观察到π',它与π最大的不同就是,它仅仅选择寻找最大的k,满足k < q && Pk是Pq的后缀,P[k+1] != P[q+1]。所以π'*[q]是π*[q]的子集,其剔除了其中P[k+1] == P[q+1]的部分。
也就是说,当我们更改π为π',对于π'*[q]的遍历将不会考虑P[k+1]==P[q+1]的k值,而已知P[q+1] != T[i],这一部分的k值正好是我们需要剔除的。所以,其不会算法的影响,反而会是原算法的一种优化。
32.4-7 未完成
Give a linear-time algorithm to determine whether a text T is a cyclic rotation of another string T0. For example, arc and car are cyclic rotations of each other.
32.4-8
Give an O(m|∑|-time algorithm for computing the transition function δ for the string-matching automaton corresponding to a given pattern P.(Hint: Prove that δ(q,a) = δ(π(q),a),if q = m or P[q + 1] != a.)
1. π= COMPUTE-PREFIX-FUNCTION(P)
2. for a∈Σ* do
3. δ(0,a) = 0
4. end for
5. δ(0,P[1]) = 1
6. for a∈Σ* do
7. for q= 1 to m do
8. if q==m or P[q+ 1] != a then
9. δ(q,a) =δ(π[q],a)
10. else
11. δ(q,a) =q+ 1
12. end if
13. end for
14. end for
第2~5行完成对于δ(0,a)的赋值,任意的a∈Σ*。它的赋值是正确的,这里不做过多解释。
第6行,第7行的循环分别遍历字符表以及状态集,对于第8行的判断,需要分情况进行讨论,
-
当判断为false时,即P[q+1] == a的情况,即q = q + 1,这是易于理解的。
-
对于判断为true时,且q == m的情况而言,可以参见本文KMP主算法的细节补充:关于δ(m,a) = δ(π(m),a)的证明。(在KMP算法最后方)
-
对于判断为true时,且q!=m && P[q+ 1] != a的情况而言,这一部分的证明,其实也完全可以采用δ(m,a) = δ(π(m),a)的证明,只需要将m更改为q即可。因为在此证明中,并没有Pm = P的特性,所以对于一般情况也是成立的。而下面将给出个人对于这一情况的一种理解。
定理,令δ是匹配字符串P的有穷自动机,其具体定义符合上文中所述的字符串匹配有穷自动机,P.length = m,则,δ(m,a) = δ(π(m),a) 。
证明,
- δ(m,a) = σ(Pma),且,δ(π(m),a) = σ(Pπ(m)a)。
- 由π(q) = max{k: k < q && Pk是Pq的后缀}的定义可知,Pπ(m)是Pm的后缀,那么,Pπ(m)a是Pma的后缀,由σ(x) = max{ k:Pk是x的后缀 }的定义可知,一个字符串的σ绝不会小于该字符串后缀的σ,可知:σ(Pma) ≥ σ(Pπ(m)a)。
- 因为σ(Pma) = max{ k:Pk是Pma的后缀 },σ(Pma) - 1∈ { k:Pk是Pm的后缀 },而π(m) = max{k: k < m && Pk是Pm的后缀}。所以,π(m) ≥ σ(Pma) - 1。而P[σ(Pma)] = a,那么,σ(Pπ(m)a) ≥ σ(Pma) 。这是因为已知Pσ(Pma) - 1是Pπ(m)的前缀,而P[σ(Pma)] = a,也就是说,σ(Pπ(m)a) ∈ { k:Pk是Pπ(m)a的后缀 },所以,σ(Pπ(m)a) ≥ σ(Pma) 。
- 由σ(Pπ(m)a) ≥ σ(Pma)以及σ(Pma) ≥ σ(Pπ(m)a)可知,δ(m,a) = δ(π(m),a) 。