第一次个人编程作业

zoukankan html css js c++ java

第一次个人编程作业
一、GitHub链接
GitHub链接
二、设计思路以及实现方法
本来是打算使用c++的，但是后面发现python的实现更加方便，可以调用库。
关于分词的方法和相似度的计算，查阅了一些资料后最后选用了jieba分词以及jaccard系数。使用jieba是因为python直接调用库方便实现，使用jaccard是因为在对比了各个分词之后发现jaccard更加适合文本类型的相似度计算。
代码设计图：
- jieba分词法：
  支持四种分词模式：
  * 精确模式，试图将句子最精确地切开，适合文本分析；
  jieba.cut(str, cut_all=False)
  * 全模式，把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义；
  jieba.cut(str, cut_all=True)
  * 搜索引擎模式，在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词。
  jieba.cut_for_search(str)
  * paddle模式，利用PaddlePaddle深度学习框架，训练序列标注（双向GRU）网络模型实现分词。同时支持词性标注。
  jieba.cut(str,use_paddle=True)
  这次代码中我选择了精准模式。
- jaccard相似度：
  给定两个集合A,B，Jaccard 系数定义为A与B交集的大小与A与B并集的大小的比值，定义如下：
三、模块介绍
- 输入输出
```
      f1 = open(argv[1],'r')
      f2 = open(argv[2],'r')
      f3 = open(argv[3],'w')

      f1_text=f1.read()
      f2_text=f2.read()
      f3.write("...")

      #print (f1.read())
      f1.close()
      f2.close()
      f3.close()
```
- jieba分词
  
  def jieba_list(text): items="" s="" for i in range(0,len(text)): #存储中文 if 'u4e00' <= text[i] <= 'u9fff': s += text[i] elif text[i] == '。': if s != "": items += s s = "" if s != "": items += s s = "" #print(items) test_items = jieba.lcut(items, cut_all=True) return test_items
- jaccard相似度
  
  def jaccard(text1,text2): #将分词去重 delete_text1 = set(text1) delete_text2 = set(text2) #print(delete_text1) #print(delete_text2) #记录相交分词的个数 temp = 0 for i in delete_text1: if i in delete_text2: temp += 1 fenmu = len(delete_text2) + len(delete_text1) - temp # 并集 jaccard_coefficient = float(temp / fenmu) # 交集 return jaccard_coefficient
四、结果输出

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8s1.txt D:studysim_0.8s1.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 2.501 seconds.
Prefix dict has been built successfully.
1.0

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_add.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 1.772 seconds.
Prefix dict has been built successfully.
0.4635416666666667

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_del.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 1.799 seconds.
Prefix dict has been built successfully.
0.6505073280721533

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_dis_1.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 1.834 seconds.
Prefix dict has been built successfully.
0.9166666666666666

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_dis_3.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 2.310 seconds.
Prefix dict has been built successfully.
0.8378220140515222

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_dis_7.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 1.915 seconds.
Prefix dict has been built successfully.
0.7338842975206612

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_dis_10.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 2.036 seconds.
Prefix dict has been built successfully.
0.6769558275678552

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_dis_15.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 1.846 seconds.
Prefix dict has been built successfully.
0.4877932024892293

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_mix.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 2.271 seconds.
Prefix dict has been built successfully.
0.6966233766233766

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_rep.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 2.032 seconds.
Prefix dict has been built successfully.
0.3995140576188823
```
  结果分析：和其他人的结果差距较大，根据分析应该是因为主要用的set和list并交集，把重复的字都省略了，导致最后得出的相似度较低。本来打算换成利用了sklearn的CounterVectorizer类和numpy的，但是sklearn库一直下载失败，就作罢了。
```
五、PSP表格
查看全文

相关阅读:
nginx安装
 redis安装配置
 阿里云试用总结
 Starting MySQL....The server quit without updating PID file错误解决办法
 mysql du-master配置
 Centos6.5上的iptables
Swift 函数提前返回
 Google 推出新搜索引擎以查找数据集
 为用户设计的产品，就应该用用户熟悉的语言
 「工具」三分钟了解一款在线流程绘制工具：Whimsical

原文地址：https://www.cnblogs.com/yaningscnblogs/p/13688022.html