自己主动机串标:Directed Acyclic Word Graph

zoukankan html css js c++ java

自己主动机串标:Directed Acyclic Word Graph
trie -- suffix tree -- suffix automa 有这么几个情况：

用户输入即时响应AJAX搜索框, 显示候选名单。
搜索引擎keyword统计数量。

后缀树(Suffix Tree): 从根到叶子表示一个后缀。

只从这一个简单的描写叙述，我们能够概念上解决以下的几个问题：

P:查找字符串o是否在字符串S中
A:若o在S中，则o必定是S的某个后缀的前缀。用S构造后缀树。按在trie中搜索字串的方法搜索o就可以。

P: 指定字符串T在字符串S中的反复次数。
A: 假设T在S中反复了两次，则S应有两个后缀以T为前缀，搜索T节点下的叶节点数目即为反复次数。

P: 字符串S中的最长反复子串。
A: 同上。找到最深的非叶节点T。

P: 两个字符串S1。S2的最长公共子串。
A: 广义后缀树(Generalized Suffix Tree)存储_多个_字符串各自的全部后缀。把两个字符串S1#。S2$增加到广义后缀树中，然后同上。

（A longest substring common to s1 and s2 will be the path-label of an internal node with the
greatest string depth in the suffix tree which has leaves labelled with suffixes from both the
strings.）

Suffix Automa: 识别文本全部子串的辅助索引结构。

以下的代码是直接翻译[1]中算法A：

/*Directed Acyclic Word Graph */ #include <stdlib.h> #include <string.h> typedef struct State{ struct State *first[26], *second[26]; struct State *suffix; }State; State *sink, *source; State *new_state(void) { State *s = malloc(sizeof *s); if(s){ memset(s, 0, sizeof *s); } return s; } /*state: parent -- [x] with xa = tail(wa) child -- [tail(wa)] new child -- [tail(wa)]_{wa} */ State *split(State *parent, int a) { int i; /*current state, child, new child*/ State *cs = parent, *c = parent->second[a], *nc = new_state(); //S1 parent->first[a] = parent->second[a] = nc; //S2 for(i = 0; i < 26; ++i){ nc->second[i] = c->second[i]; //S3 } nc->suffix = c->suffix; //S4 c->suffix = nc; //S5 for(cs = parent; cs != source; ){//S6,7 cs = cs->suffix; //S7.a for(i = 0; i < 26; ++i){ if(cs->second[i] == c)cs->second[i] = nc; //S7.b else goto _out; //S7.c } } _out: return nc; //S8 } /*state: new sink -- [wa] */ void update(int a) { /*suffix state, current state, new sink*/ State *ss = NULL, *cs = sink, *ns = new_state(); //U1,2 sink->first[a] = ns; while(cs != source && ss == NULL){//U3 cs = cs->suffix; //U3.a if(!cs->first[a] && !cs->second[a]){ cs->second[a] = ns; //U3.b.1 }else if(cs->first[a]){ ss = cs->first[a]; //U3.b.2 }else if(cs->second[a]){ ss = split(cs, a); //U3.b.3 } } if(ss == NULL){ss = source;} //U4 ns->suffix = ss; sink = ns; //U5 } int build_dawg(char *w) { sink = source = new_state(); for(; *w; ++w){update(*w-'a');} }

我还在努力理解中，没有測试。

[1] the smallest automation recognizing the subwords of a text

https://cbse.soe.ucsc.edu/sites/default/files/smallest_automaton1985.pdf

版权声明：本文博客原创文章，博客，未经同意，不得转载。
查看全文

相关阅读:
美国首位女计算机博士荣获今年图灵奖
 此人需要关注一下
 Microsoft的壮大与IBM对Sun的收购
 文章介绍：Sexy Lexing with Python
程序员的门道
 闲谈：敏捷与否的区分方法、对组织内部人员的现实作用与长远利益
 聊聊最俗的工厂相关话题
 人之患在好为人师
 TIOBE的头头儿和“反Java”的教授
 敏捷的核心究竟是什么

原文地址：https://www.cnblogs.com/mfrbuaa/p/4739130.html