zoukankan      html  css  js  c++  java
  • Individual Project

    BUAA Advanced Software Engineering

    Project:  Individual Project - Word frequency program

    Ryan Mao (毛宇)-1106116_11061171

    Implement a console application to tally the frequency of words under a directory (2 modes).

    1)  Before you implement this project, Record your estimate about the time you WILL spend in each component of your program.

    Before I started, I split up the assignment into three parts based on my comprehension of the project:

    Firstly, I would have to design my approach to the problem (words frequency tally), I decided to divide the whole process as below:

    1. Traverse the directory: A function which is capable of visiting all the files/subfolders under the specific directory. I chose to do it with the algorithm of breadth-first traverse.

    Since traverse the directory cannot be achieved without Windows API (as far as I know), I planned to spend half an hour to finish the study of the API.

    1. Statistics: A module used to record how many times a certain word has been read. Factors including the mode, the case and so forth will be taken into consideration. Before I started, I could not tell the details of it, I planned to spend two hours on this part.
    2. Sort&&Output: The process compasses rank the records according to the requirement mentioned. With the assistant of the library “algorithm”, I guessed I could finish it half an hour.
    3. Debug&&improving the performance: In this section, the time is rather unpredictable, thus I planned to spend less than 2 hour on it.
    4. Blog: 1.5 hours.

    In conclusion, to finish the assignment, 6.5 hours was estimated.

    2)            After you had implemented this project, record the ACTUAL time you spent in each component of your program.

    Actually, I spent about 10 hours on this assignment. For each component:

    1. 2 hour (Traverse the directory)

    2. 2.5 hours (Statistics)

    3. 1.5 hour (Sort&&Output)

    4. 2.5 hours (Debug&&improving the performance)

    5. 1.5 hours (Blog)

    3)      Describe how much time you spent on improving the performance of your program, and show a performance analysis graph (generated by VS2012 performance analysis tool), if possible, please show the most costly function in your program.

         I did not spent much time on debugging (about 40 minutes), but I spent more time on Improving the performance (1 hour and 50 minutes)

    In this case, I use a folder (5MB), which includes 25 English novels.

     

    graph:

    Summary

    Function Details

    Trace the result to find the most time-consuming part: (As below)

    It shows that the function used to traverse the files costs the most time (It’s quite surprising since I thought it is the easiest part!)

    Let’s go deeper!

     

    Thus, the result is quite obvious that the function statistics used to analyze each file costs the most time.

    Deeper…

     

    The result is obvious We should make some improvement on the red parts of the source code.

    Analysis

    About I just mentioned,

                if(low_first_table.count(temp)==0)//判断是否第一次出现

           {

             low_first_table[temp]=word;

           }

           //统计

           count_table[low_first_table[temp]]++;

    Undisputedly, it would be better for me to use the function “find” instead.

    Also, about the second part, it would be better if I can store the value of low_first_table[temp] in advance to avoid redundant visits.

    So it should be like this:

                if(low_first_table.find(temp)==low_first_table.end())//判断是否第一次出现

           {

             low_first_table[temp]=word;

             count_table[word]++;

           }

           //否则按照其存储结果来统计

           else

           {

             count_table[low_first_table[temp]]++;

           }

    Let’s take a look at another possible place to make an improvement: rank.

    Firstly, because the structure “Vector<>” is said to have the best capacity allowing random-data-access, indicating that it is right to use Vector under this circumstance. (From C++ Primer. P287).

    Secondly, the function “std::sort” is described in document to have a perfect time complexity () for sorting algorithms.

    Third, sorting is indispensable due to the requirement.

    Thus it would be rather tricky to make an improvement on it.

    The Latter Graph (After improvement) is:

     

    It’s not very obvious, but we can still observe the improvement:

     

    3)            Share your 10 test cases, and how did you make sure your program can produce the correct result. (Programs with incorrect result will get 0 points, regardless of speed)

    My test cases are designed to contain several “exceptional situation”,includes:

    1. Empty folder (obvious)
    2. Empty file (obvious)
    3. Same words with different upper/lower case. Like “Hello” “hello” “heLLO” .etc.
    4. Same words with different ending numbers. (To test extend mode) Like Windows98, Window99, Window8. Etc.
    5. Words separated with non-alphanumerical char as delimiter. Like: Hello~!!World!****China~~I)()()(Love^&^&You
    6. Different Words with the same frequency. (To test the “sort” function)
    7. String with 3 or less letters. Like Hi, My. Etc
    8. String starts with number. Like 123Hello
    9. String separated with numbers. Like humo99rous.
    10. A large amount of test case. As below: (To see if the program will collapse)

     

    Besides, I also download the Rost English Words Counter From the Internet to compare the result with mine.

     

    5)      Describe what you had learned in this exercise.

    In this exercise, I planned to solve the problem in 6.5 hours. However, it took me almost ten hours. I was surprised to learn how unpredictable the software engineering is. Thus, in the sequent studies, I would try to improve my ability of making plans which is considerate and feasible. It would be hard since it requires experience and a broad knowledge on this field. Anyway, I would try my best to learn this course well.

    What’s more, I also learned how to use the Performance Analysis feature in Visual Studio 2012. It’s an awesome tool for developers to make improvement on their projects. I would use it more frequently in future.

    Finally, I also learned that it is important to understand clearly the user requirement of it. In this case, I misunderstood the word “delimiter” and waste a lot of time on debugging. So, the next time, I would make sure what the requirement really means.

    Thanks!

  • 相关阅读:
    docker-compose 命令详解
    Ubuntu 安装 rabbitmq
    scrapy.cmdline.execute
    queue.Queue()
    多线程通信
    多线程(thread+queue 售票)
    协程
    线程
    利用Nginx实现反向代理web服务器
    利用Nginx实现反向代理web服务器
  • 原文地址:https://www.cnblogs.com/RylynnMao/p/3322064.html
Copyright © 2011-2022 走看看