zoukankan html css js c++ java

C++里创建 Trie字典树（中文词典）（二）（插入、查找、导入、导出）

　　萌新做词典第二篇，做得不好，还请指正，谢谢大佬！

　　做好了插入与遍历功能之后，我发现最基本的查找功能没有实现，同时还希望能够把内存的数据存入文件保存下来，并可以从文件中导入词典。此外，数据的路径是存在配置文件中的。甚至，还想尝试类似自动补全的功能。当然了，是做一个比较low的补全，比如传入“编程”，能够得到”软件“、”学习“、”学习网站“、”入门“四个字符串。但是传入“编”不会得到“程”，因为“编程”没有录入词典。于是对原有的类进行了修改，并添加了一些函数。

　　先放上运行结果吧：

 1 test1
 2 ID : 2    char : 件    word : 编程软件
 3 ID : 3    char : 习    word : 编程学习
 4 ID : 4    char : 站    word : 编程学习网站
 5 ID : 1    char : 门    word : 编程入门
 6 
 7 test2
 8 ID : 1    char : 门    word : 编程入门
 9 ID : 3    char : 习    word : 编程学习
10 ID : 4    char : 站    word : 编程学习网站
11 ID : 2    char : 件    word : 编程软件
12 find ID : 3    word : 编程学习

　　test1就是上一步里，把word的id打印了出来，并导出到文件中。

　　test2是从文件中导入，然后打出结果，并查找“编程学习”，如果存在就打出id。

　　首先，新增了一个配置文件类DictionaryConf，如下：

DictionaryConf.h

 1 #ifndef __DICTIONARYCONF_H__
 2 #define __DICTIONARYCONF_H__
 3 
 4 #include <string>
 5 
 6 namespace ccx{
 7 
 8 using std::string;
 9 
10 class DictionaryConf
11 {
12     public:
13         DictionaryConf();
14         string getDictionaryPath();
15     private:
16         void GetConf();
17         string _DictionaryPath;
18 };
19 
20 }
21 
22 #endif

　　它的功能很简单，在构造的时候读取文件配置信息，对外只提供一个接口，就是获取一个字符串。它的实现如下：

DictionaryConf.cc

 1 #include "DictionaryConf.h"
 2 #include <json/json.h>
 3 #include <stdlib.h>
 4 #include <fstream>
 5 #include <iostream>
 6 
 7 namespace ccx{
 8 
 9 using std::ifstream;
10 using std::cout;
11 using std::endl;
12 
13 DictionaryConf::DictionaryConf()
14 {
15     GetConf();
16 }
17 
18 void DictionaryConf::GetConf()
19 {
20     ifstream ifs;
21     ifs.open("DictionaryConf.json");
22     if(!ifs.good())
23     {
24         cout << "open DictionaryConf.json error" << endl;
25         exit(EXIT_FAILURE); 
26     }
27     
28     Json::Value root;
29     Json::Reader reader;
30 
31     if(!reader.parse(ifs, root, false))
32     {
33         cout << "DictionaryConf json read error" << endl;
34         exit(EXIT_FAILURE);
35     }
36     
37     _DictionaryPath = root["DictionaryPath"].asString();
38     
39     ifs.close();
40 }
41 
42 string DictionaryConf::getDictionaryPath()
43 {
44     return _DictionaryPath;
45 }
46 
47 }

　　配置文件是写成json格式的，方便读取：

DictionaryConf.json

1 {"DictionaryPath" : "Dictionary.json"}

　　然后是修改已有的Dictionary类，增加它的功能（总觉得是不是要变胖接口了0。0）

Dictionary.h

 1 #ifndef __DICTIONARY_H__
 2 #define __DICTIONARY_H__
 3 
 4 #include "DictionaryData.h"
 5 #include "DictionaryConf.h"
 6 
 7 #include <memory>
 8 #include <vector>
 9 #include <list>
10 
11 namespace ccx{
12 
13 using std::shared_ptr;
14 using std::vector;
15 using std::list;
16 
17 class Dictionary
18 {
19     typedef unordered_map<string, pDictElem>::iterator WordIt;
20     public:
21         Dictionary();
22         void push(const string & word);
23         void push(vector<string> & words);
24         int search(const string & word);//查找，返回词的ID，如果没找到返回-1
25     private:
26         void AddWord(const string & word, int wordId);
27         void splitWord(const string & word, vector<string> & characters);//把词拆成字
28         pDictElem _dictionary;
29         DictionaryConf _conf;    
30 
31 //遍历用
32     public:
33         void next();
34         string getCurChar();
35         string getCurWord();
36         int getCurWordId();
37         void resetPoint();
38         bool isEnd();
39     private:
40         void nextWord();
41         pDictElem _pcur;
42         WordIt _itcur;
43 
44 //用list实现栈，遍历时方便
45         list<WordIt> _stackWord;
46         list<pDictElem> _stackDict;
47 
48 //导入导出
49     public:
50         void leading_in();
51         void leading_out();
52 };
53 
54 }
55 
56 #endif

　　这又是一个最终的类。比之前的多了一个查找的函数，一个Add函数，和导入导出的函数。

　　Add函数是原来的push改个名字，原因就是从我这里导出到文件的词是包含他的ID的，导入的时候的ID是固定的，所以改成在添加的时候要传入ID。对于提供的接口push，调用了Add函数。

　　改动过的Add和push如下：

 1 void Dictionary::AddWord(const string & word, int wordId)
 2 {
 3     vector<string> characters;
 4     splitWord(word, characters);
 5     
 6     vector<string>::iterator it_char;
 7     it_char = characters.begin();    
 8     pDictElem root;
 9     root = _dictionary;
10     for(; it_char != characters.end(); ++it_char)
11     {
12         WordIt it_word;
13         it_word = root->_words.find(*it_char);
14         
15         if(it_word == root->_words.end())
16         {
17             pair<string, pDictElem> temp;
18             temp.first = *it_char;
19             pDictElem dictemp(new DictElem);
20             dictemp->_word = *it_char;
21             dictemp->_wordId = 0;
22             temp.second = dictemp;
23             root->_words.insert(temp);
24             root = dictemp;
25         }else{
26             root = it_word->second;
27         }
28     }
29     if(!root->_wordId)
30     {
31         root->_wordId = wordId;
32     }
33 }

　　总感觉29行的if可以去掉呢--！但是没有去试

 1 void Dictionary::push(const string & word)
 2 {
 3     ++(_dictionary->_wordId);
 4     AddWord(word, _dictionary->_wordId);
 5 }
 6 
 7 
 8 void Dictionary::push(vector<string> & words)
 9 {
10     int size = words.size();
11     for(int i = 0; i < size; ++i)
12     {
13         push(words[i]);
14     }
15 }

　　对root的wordID的自增放到了这里。这样，导和的时候就可以调用Add函数了。（好吧我就是懒得重新写一个插入的函数0。0）

　　接下来是查找的函数，比较简单，直接把Add拷过来改了一下：

 1 int Dictionary::search(const string & word)
 2 {
 3     vector<string> characters;
 4     splitWord(word, characters);
 5     
 6     vector<string>::iterator it_char;
 7     it_char = characters.begin();    
 8     pDictElem root;
 9     root = _dictionary;
10     for(; it_char != characters.end(); ++it_char)
11     {
12         WordIt it_word;
13         it_word = root->_words.find(*it_char);
14         
15         if(it_word == root->_words.end())
16         {
17             return -1;
18         }else{
19             root = it_word->second;
20         }
21     }
22     return root->_wordId;
23 }

　　关于以上，有另外一个想法：不管是在插入还是查找，都是要先“查找”，看看有没有那个节点。对于查找操作，只要返回有或者没有就行了；而对于插入，如果没有就返回它的位置，从那里开始添加节点，这样似乎又可以复用一些代码了 0。0

　　最后就是导入和导出了。

　　关于这块，写之前一直在纠结，是要遍历所有词，把词完整地存下来。采用json的格式，对于“编程学习”与“编程训练”：

　　直接存进文件，后面带上ID的效果：

1 [
2     { "word" : "编程学习" , "id" : 1},
3     { "word" : "编程训练" , "id" : 2)
4 ]

　　还是把树按节点存下来，

（后来发现其实这是一个序列化与反序列化的问题——2016.12.4 23:54 注）

 1 [
 2     {"key" : "编" , "flag" : 1, "id" : 0 , "value" : 
 3 　　　　{"key" : "程" , "flag" : 1 , "id" : 0 , "value" : 
 4 　　　　　　{"key" : "学" , "flag" : 1 , "id" : 0, "value" : 
 5 　　　　　　　　{"key" : "习" , "flag" : 0 , "id" : 1, "value" : "NULL"
 6 　　　　　　　　}
 7 　　　　　　},
 8 　　　　　　{"key" : "训" , "flag" : 1 , "id" : 0, "value" : 
 9 　　　　　　　　{"key" : "练" , "flag" : 0 , "id" : 2, "value" : "NULL"
10 　　　　　　　　}
11 　　　　　　}
12 　　　　}
13 　　}
14 ]

　　大致是这样一层嵌套一层的结构。虽然这种结构可以用栈来实现，即把Json::Value型数据入栈，但是后来想想这样虽然一次只存一个结点，好像很节约空间，但是每个节点还要存一些额外的信息，逻辑似乎比较复杂，于是决定用第一种方法来实现，比较简单粗爆。

　　代码如下：

 1 void Dictionary::leading_out()
 2 {
 3     Json::Value root;
 4     Json::FastWriter writer;
 5 
 6     resetPoint();
 7     
 8     while(!isEnd())
 9     {
10         Json::Value elem;
11         elem["Word"] = getCurWord();
12         elem["WordId"] = getCurWordId();
13         root.append(elem);
14         next();
15     }
16 
17     string words;
18     words = writer.write(root);
19     
20     ofstream ofs;
21     const char * path = _conf.getDictionaryPath().c_str();
22     ofs.open(path);
23     if(!ofs.good())
24     {
25         cout << "open Dictionary.json error(leading_out)" << endl;
26         ofs.open("Dictionary.tmp");
27         if(!ofs.good())
28         {
29             exit(EXIT_FAILURE);
30         }
31     }
32     
33     ofs << words;
34     ofs.close();
35 }

　　存入文件的效果如下：

1 [{"Word":"编程软件","WordId":2},{"Word":"编程学习","WordId":3},{"Word":"编程学习网站","WordId":4},{"Word":"编程入门","WordId":1}]

　　只有一行（不要在意id顺序）。

　　之前想过一种方法能够让它以多行的形式存到文件中，做了如下处理：

 1     .
 2     .
 3     .
 4 ofs << "[" << endl;
 5 int size = root.size();
 6 for(int i = 0; i < size; ++i)
 7 {
 8     string json_file = writer.write(root[i]);
 9     int len = json_file.size();
10     json_file[len - 1] = ' ';
11     ofs << "	" << json_file;
12     if(i != size -1)
13     {
14         ofs << "," << endl;
15     }
16     }
17 ofs << endl << "]" << endl;
18 ofs.close();
19     .
20     .
21     .

　　一项一项地取出，把最后的' '替换成空格，然后自己控制输出格式。

　　导入格式要与导出格式对应，也使用json，如下：

 1 void Dictionary::leading_in()
 2 {
 3     ifstream ifs;
 4     const char * path = _conf.getDictionaryPath().c_str();
 5     ifs.open(path);
 6     if(!ifs.good())
 7     {
 8         cout << "open Dictionary.json error(leading_in)" << endl;
 9     }else{
10         Json::Value root;
11         Json::Reader reader;
12         
13         if(!reader.parse(ifs, root, false))
14         {
15             cout << "json read Dictionary.json error" << endl;
16         }else{
17             int size = root.size();
18             for(int i = 0; i < size; ++i)
19             {
20                 string word = root[i]["Word"].asString();
21                 int wordId = root[i]["WordId"].asInt();
22                 AddWord(word, wordId);
23                 ++(_dictionary->_wordId);
24             }
25         }
26     }
27 }

　　这里要说明一下，导出和导入打开文件如果失败了没有exit的原因：

　　对于导出，如果打开一个文件失败了直接exit，那内存中的东西就全部丢失了，于是就采用在当路径下创建一个临时文件来存要存的内容。

　　对于导入，打开文件失败只是说明路径有问题或者文件本身有问题，修改一下就可以了，或者只是没有导入词典而已，程序中如果已经添加了一些词，此时退出也会使内容丢失。

　　当然，目前的机制还是有问题的，比如词ID的设置，如果在内存中已有词的情况下导入新词，可能会使ID错乱。对于导出，或许可以设置一个机制：任何位置要退出的时候都把内容存入临时文件中，可能会更好。

　　做到这里，突然又想实现词的补全功能：传入“编程”，能传出“学习”、“学习网站”等的集合，这个做好了再补上吧！

查看全文

相关阅读:
solr中facet、group查询
 Solr开发文档
 Java的Properties类和读取.properties文件
 CentOS RPM安装MySQL-5.6
C#中的 IList, ICollection ,IEnumerable 和 IEnumerator
Linq to Xml
Linq查询
 XDocument和XmlDocument的区别
 关于扩展方法
 今天刚刚开通了园子

原文地址：https://www.cnblogs.com/chinxi/p/6131109.html