zoukankan      html  css  js  c++  java
  • 福大软工1816 · 第五次作业

    一、前言

    结对队友:031602428 苏路明

    队友博客传送门
    本次作业博客
    Github项目传送门

    二、分工细则

    作业开始做之前就确定了明确的分工
    分工如下:
    -关于爬虫部分我完成一份java的爬虫,他也完成了一份python的爬虫(都完成之后再进行选择使用)
    -使用java分部分完成内容我完成-w及-n内容,他完善-m及剩余内容

    三、PSP表格

    PSP2.1 Personal Software Process Stages 预估耗时(分钟) 实际耗时(分钟)
    Planning 计划 100 150
    · Estimate · 估计这个任务需要多少时间 100 150
    Development 开发 700 900
    · Analysis · 需求分析 (包括学习新技术) 150 200
    · Design Spec · 生成设计文档 40 60
    · Design Review · 设计复审 100 150
    · Coding Standard · 代码规范 (为目前的开发制定合适的规范) 0 0
    · Design · 具体设计 100 150
    · Coding · 具体编码 0 0
    · Code Review · 代码复审 0 0
    · Test · 测试(自我测试,修改代码,提交修改) 0 0
    Reporting 报告 90 160
    · Test Repor · 测试报告 90 130
    · Size Measurement · 计算工作量 5 5
    · Postmortem & Process Improvement Plan · 事后总结, 并提出过程改进计划 30 50
    |       | 	合计  |1505 |2105
    

    四、解题思路描述与设计实现说明

    爬虫使用
    导入jsoup使用java进行爬虫
    1.给定网站地址

     	getContent(rooturl);```
     2.对每一篇进行爬取并所需的信息并且按照正确的格式输出到result.txt中
    
     try {
     		File file = new File("src\cvpr\result.txt");
     		BufferedWriter bufferedWriter= new BufferedWriter(new FileWriter(file));
     		org.jsoup.nodes.Document document = Jsoup.connect(rooturl).maxBodySize(0)
     											.timeout(1000000)
     											.get();
     		Elements elements =  document.select("[class=ptitle]");
     		Elements hrefs = elements.select("a[href]");
     		int count = 0;
     		for(Element element:hrefs) {
     			String url = element.absUrl("href");
     			org.jsoup.nodes.Document documrnt2 = Jsoup.connect(url).maxBodySize(0)
     												.timeout(1000000)
     												.get();
     			Elements elements2 = (Elements) documrnt2.select("[id=papertitle]");
     			String title = elements2.text();
     			if(count != 0)
     				bufferedWriter.write("
    " + "
    " + "
    ");
     			bufferedWriter.write(count + "
    ");
     			bufferedWriter.write("Title: " + title + "
    ");
     			Elements elements3 = (Elements) documrnt2.select("[id=abstract]");
     			String Abstract = elements3.text();
     			bufferedWriter.write("Abstract: " + Abstract);
     			count++;
     		}
     		bufferedWriter.close();
     	}catch (Exception e) {
     		// TODO: handle exception
     		e.printStackTrace();
     	}
     
    
    **代码组织与内部实现设计(类图)**
    - **类图:**
    ![](https://img2018.cnblogs.com/blog/1473263/201810/1473263-20181010142417425-1619477545.png)
    关于部分函数的结构及其函数的接口如下:
    

    public static String Read(String pathname) //对文件进行读取且处理
    public static void FindWordArray(List tempLists, int len, String wordsLine)//寻找符合题意的词组
    public static void WordCount(List tempLists,int weight)//统计权重
    public static void SortMap(Map<String,Integer> oldmap,int wordline,int wordcount,int characterscount,int flagN)//进行排序并输出

    
    - **流程图:**
    ![](https://img2018.cnblogs.com/blog/1473263/201810/1473263-20181010135852642-1938576913.png)
    - **说明算法的关键与关键实现:**
    
        **1.首先先对文本进行读取,且进行预处理,将文本内容转换为小写**
        **2.使用Pattern和Matcher对爬取出来的Title及Abstract内容进行抽取出来,此时readline()逐行进行抽取,并且进行字符数和行数的统计**
        **3.根据之前的题意,对单词合法性进行判断,不合法的单词不进行处理**
        **4. 对词组进行分割,此时进行统计单词数**
        **5.根据判断w是否为1,进行单词或词组的权重统计**
        **6.进行排序后根据题意输出**
    
    
    ##五、附加题设计与展示
    1.爬取作者信息,生成CVPR2018最强作者排行榜(将作者按关联论文数排序输出)
    [作者关联论文数排行榜](https://pan.baidu.com/s/1a1tzsiIL7Qw3nVeP6TcrYA)
    ![](https://img2018.cnblogs.com/blog/1474721/201810/1474721-20181012150622356-219586908.png)
    
    2.爬取2014-2018年份的CVPR论文,按年份输出并分析论文数量趋势(部分链接404导致丢失部分论文)
    [2014-2018论文](https://pan.baidu.com/s/1a1tzsiIL7Qw3nVeP6TcrYA)
    趋势图(待)
    
    4.五年汇总大牛词云(部分链接404导致丢失部分词汇)
    ![](https://img2018.cnblogs.com/blog/1474721/201810/1474721-20181012150310180-1317256387.png)
    
    3.近五年论文的热门词汇词云
    ![](https://img2018.cnblogs.com/blog/1474721/201810/1474721-20181012150323933-1818306734.png)
    
    
    ##六、关键代码解释
    
    public static String Read(String pathname) throws Exception {
    

    // Scanner scanner=new Scanner(System.in);
    // String pathname=scanner.nextLine();

    	Reader myReader = new FileReader(pathname);
    	Reader myBufferedReader = new BufferedReader(myReader);
    	
    
    	//先对文本处理
    	
    	CharArrayWriter  tempStream = new CharArrayWriter();
    	int i = -1;
    	do {
    		if(i!=-1)
    			tempStream.write(i);
    		i = myBufferedReader.read();
    		if(i >= 65 && i <= 90){
    				i += 32;
    		}
    	}while(i != -1);
    	myBufferedReader.close();
    	Writer myWriter = new FileWriter(pathname);
    	tempStream.writeTo(myWriter);
    	tempStream.flush();
    	tempStream.close();
    	myWriter.close();
    	return pathname;
    }
    

    String readLine = null;
    Pattern pattern1 = Pattern.compile("(title): (.)");
    Pattern pattern2 = Pattern.compile("(abstract): (.
    )");
    while((readLine = bufferedReader.readLine()) != null)
    {
    Matcher matcher1=pattern1.matcher(readLine);
    Matcher matcher2=pattern2.matcher(readLine);
    if(matcher1.find())
    {
    characterscount+=matcher1.group(2).length();
    wordline++;
    // System.out.println(matcher1.group(2));
    String[] wordsArr1 = matcher1.group(2).split("[^a-zA-Z0-9]"); //过滤
    for (String newword : wordsArr1) {
    if(newword.length() != 0){
    if((newword.length()>=4)&&(Character.isLetter(newword.charAt(0))&&Character.isLetter(newword.charAt(1))&&Character.isLetter(newword.charAt(2))&&Character.isLetter(newword.charAt(3))))
    {
    wordcount++;
    if(len == 1)
    lib.titleLists.add(newword);
    }
    }
    }

        		//new
        		String wordsLine = matcher1.group(2);
    

    // System.out.println("wordsLine " + wordsLine);
    if(len != 1 || wordsLine.length() < 4) {
    lib.FindWordArray(lib.titleLists, len, wordsLine);
    }
    }
    if(matcher2.find())
    {
    characterscount+=matcher2.group(2).length();
    wordline++;
    //System.out.println(matcher1.group(2));
    String[] wordsArr2 = matcher2.group(2).split("[^a-zA-Z0-9]"); //过滤
    for (String newword : wordsArr2) {
    if(newword.length() != 0){
    if((newword.length()>=4)&&(Character.isLetter(newword.charAt(0))&&Character.isLetter(newword.charAt(1))&&Character.isLetter(newword.charAt(2))&&Character.isLetter(newword.charAt(3))))
    {
    wordcount++;
    if(len == 1)
    lib.abstractLists.add(newword);
    }
    }
    }

        		 String AbsLine = matcher2.group(2);
        		 if(len != 1 || AbsLine.length() < 4) {
         			lib.FindWordArray(lib.abstractLists, len, AbsLine);
        		 }
        	 }
    	}
    

    public static void FindWordArray(List tempLists, int len, String wordsLine) {

    	int tempi = 0;
    	int cnti = 0;
    	int cntt = 0;
    	String temp = "";
    	String[] words = new String[len];
    	String[] separators = new String[len];
    	for(int i = 0; i < wordsLine.length(); i++) {
    		//The four words in front of a new word
    		if (tempi < 4 && Character.isLetter(wordsLine.charAt(i)))
    		{
    			tempi ++;
    

    // System.out.println("<4 " + i + " " + wordsLine.charAt(i));
    temp = temp + wordsLine.charAt(i);

    			//A new word appear.
    			if (i == wordsLine.length() - 1) {
    				words[cnti%len] = temp;
    				cnti ++;
    				cntt ++;
    

    // System.out.println("word " + temp);

    				//A new wordarray appear.
    				if(cntt == len) {
    					String wordArray = "";
    					for(int j = 0; j < len; j++) {
    						wordArray = wordArray + words[(cnti + j)%len];
    						if(j != len-1)	wordArray = wordArray + separators[(cnti + j)%len];
    					}
    					tempLists.add(wordArray);
    

    // System.out.println("wordArray " + wordArray);
    cntt --;
    }
    }
    }
    else if (tempi >= 4) {
    tempi ++;
    if(Character.isLetter(wordsLine.charAt(i)) || Character.isDigit(wordsLine.charAt(i))) {
    // System.out.println("1 >=4 " + i + " " + wordsLine.charAt(i));
    temp = temp + wordsLine.charAt(i);

    				//A new word appear.
    				if (i == wordsLine.length() - 1) {
    					words[cnti%len] = temp;
    					cnti ++;
    					cntt ++;
    

    // System.out.println("word " + temp);

    					//A new wordArray appear.
    					if(cntt == len) {
    						String wordArray = "";
    						for(int j = 0; j < len; j++) {
    							wordArray = wordArray + words[(cnti + j)%len];
    							if(j != len-1)	wordArray = wordArray + separators[(cnti + j)%len];
    						}
    

    // add wordArray to list
    tempLists.add(wordArray);
    // System.out.println("wordArray " + wordArray);
    cntt --;
    }
    }
    }
    else {
    // System.out.println("2 >=4 " + i + " " + wordsLine.charAt(i));

    				//A new word appear.And a separator appear.
    				words[cnti%len] = temp;
    				cnti ++;
    				cntt ++;
    

    // System.out.println("word 123 " + temp);
    if(cntt == len) {
    String wordArray = "";
    for(int j = 0; j < len; j++) {
    wordArray = wordArray + words[(cnti + j)%len];
    if(j != len-1) wordArray = wordArray + separators[(cnti + j)%len];
    }
    // add wordArray to list
    tempLists.add(wordArray);
    // System.out.println("wordArray " + wordArray);
    cntt --;
    }
    if (i + 4 >= wordsLine.length())
    break;
    tempi = 0;
    temp = "";

    				//draw a separator
    				String tempSeparator = "" + wordsLine.charAt(i);
    

    // System.out.println("Separator" + tempSeparator + "123");
    for(int j = 1; j < wordsLine.length() - i; j++) {
    if( Character.isDigit(wordsLine.charAt(i+j)) || Character.isLetter(wordsLine.charAt(i+j)) ) {
    // System.out.println("123");
    temp = "";
    separators[(cnti-1)%len] = tempSeparator;
    break;
    }
    else tempSeparator = tempSeparator + wordsLine.charAt(i+j);
    }
    }
    }

    		//A invalid word appear
    		else {
    

    // System.out.println("invalid " + i + "" + wordsLine.charAt(i));
    if (i + 4 >= (int)wordsLine.length())
    break;
    tempi = 0;
    temp = "";
    cnti = 0;
    cntt = 0;
    }
    }
    }

    public static void WordCount(List tempLists,int weight) {
    for (String li : tempLists) {
    if(wordsCount.get(li) != null){
    wordsCount.put(li,wordsCount.get(li) + weight);
    }else{
    wordsCount.put(li,weight);
    }

        } 
    }
    

    public static void SortMap(Map<String,Integer> oldmap,int wordline,int wordcount,int characterscount,int flagN) throws IOException{

        ArrayList<Map.Entry<String,Integer>> list = new ArrayList<Map.Entry<String,Integer>>(oldmap.entrySet());  
          
        Collections.sort(list,new Comparator<Map.Entry<String,Integer>>(){  
            @Override  
            public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) {  
                return o2.getValue() - o1.getValue();  //降序  
            }  
        });  
        File file = new File("result.txt");
        BufferedWriter bi = new BufferedWriter(new FileWriter(file));
        bi.write("characters: "+characterscount+"
    ");
        bi.write("words: "+wordcount+"
    ");
        bi.write("lines: "+wordline+"
    ");
        int flag = 0;
        for(int i = 0; i<list.size(); i++){  
        	if(flag>=flagN) break;
        	if(list.get(i).getKey().length()>=4)
        		bi.write("<"+list.get(i).getKey()+">"+ ": " +list.get(i).getValue()+"
    "); 
        	flag++;
        }
        bi.close();
    }
    
    
    ##七、性能分析与改进
    - 改进思路
        1.统计单词和统计词组是分离的,导致程序性能有所下降,可改善整合统计单词和统计词组部分。
        2.在分割词组时,采用逐字符读取,使用循环数组保存单词和分隔符,如改善使用正则匹配,性能应该会有所提升。
        3.在统计长文件时,字符数会和他人有所不同(貌似一人一个答案),寻求了解决方案后发现好像是由于存在非ASCII码的原因,改善问题不在此次作业范围内。
        4.其余部分在个人作业时,所表现的性能还是比较好的,暂时没有改善的思路。
    - 代码覆盖率
    ![](https://img2018.cnblogs.com/blog/1474721/201810/1474721-20181011012221999-434837612.png)
    
    - 性能测试
    ![](https://img2018.cnblogs.com/blog/1474721/201810/1474721-20181011012338814-438969164.png)
    
    ##八、单元测试
    以下为我进行的单元测试,包含大概的描述和输出的信息
    ![](https://img2018.cnblogs.com/blog/1473263/201810/1473263-20181012184137577-1083130898.png)
    
    
    ##九、Github的代码签入记录
    ![](https://img2018.cnblogs.com/blog/1474721/201810/1474721-20181011012156540-1796481234.png)
    
    
    ##十、遇到的代码模块异常或结对困难及解决方法
    - **问题描述**:
    1.使用python进行爬虫时有时候缺少部分内容
    2.使用正则分割去出单词时,无法保留最后需输出的分隔符
    - **做过哪些尝试**:
    1.  
        ·对代码进行查错
        ·上网查询类似问题及解决方法
        ·对代码进行改进
    2.
        ·对正则进行更多学习了解
        ·使用其他方法进行分割保留分隔符
    - **是否解决**:
        ·重新学习写一个java的爬虫
    - **有何收获**:
        ·解决一个问题的时候,如果一种方式怎么样都做不到,解决不了,可以尝试换一种方法来解决
    
    ##十一、我的队友
    我的队友是真的牛,什么都会,我写到不会的或者有缺少的,他都会告诉我,超级厉害的。
    
    ##末:学习进度条
    |||||||
    |:--|:--|:--|:--|:--|:--|
    |**第N周**|**新增代码(行)**|**累计代码(行)**|**本周学习耗时(小时)**|**累计学习耗时(小时)**|**重要成长**|
    |1|500|500|10|30|eclipse的新学习,发现熟悉更多新方法|
    |2|0|200|4|10|了解api,学习新用法新接口新类|
    |3|0|300|12|12|加深掌握了Axure的使用,学会了使用NABCD模型进行需求分析|
    |4|200|200|10|10|学会简单的java爬虫|
    |5|200|200|10|40|eclipse的新学习,发现熟悉更多新方法|![](https://img2018.cnblogs.com/blog/1473263/201809/1473263-20180921203759224-754094179.jpg)
  • 相关阅读:
    8.电影推荐
    一.Memcached企业服务
    7.学完linux系统运维到底可以做什么?
    svn+jenkins自动部署
    关于gitlab+jenkins自动部署代码的实现
    Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggre
    php实现雪花算法(ID递增)
    php使用rdkafka进行消费
    Burp破解安装(1.7和2.0)
    在已有lnmp环境的基础上安装PHP7
  • 原文地址:https://www.cnblogs.com/T1DE/p/9766097.html
Copyright © 2011-2022 走看看