zoukankan      html  css  js  c++  java
  • [Python学习笔记-008] 使用双向链表去掉重复的文本行

    用Python处理文本文件是极方便的,当文本文件中有较多的重复的行的时候,将那些重复的行数去掉并打印诸如"...<repeats X times>..."有助于更好的浏览文本文件的内容。下面将通过Python打造一个双向链表来实现这一功能。如果你对在Python中实现双向链表感兴趣,不妨花五分钟读一读。Have fun :-)

    01 - 定义链表结点

    1 struct node {
    2     int           lineno;
    3     char          *line;
    4     char          *md5;
    5     char          *dupcnt; /* duplicated counter */
    6     struct node   *prev;
    7     struct node   *next;
    8 }; 

    在Python3中,可以使用字典定义这样的结点。例如:

    1 node = {}
    2 node['lineno'] = index + 1
    3 node['line'] = line.strip().rstrip()
    4 node['md5'] = md5txt
    5 node['dupcnt'] = 0
    6 node['prev'] = index - 1
    7 node['next'] = index + 1

    由于Python的list本身就是可变数组,这就省力多了,我们不需要从C的角度去考虑链表的建立。

    02 - 初始化双向链表

     1 def init_doubly_linked_list(l_in):
     2     l_out = []
     3     index = 0
     4     for text in l_in:
     5         data = text.strip().rstrip()
     6         md5 = hashlib.md5(data.encode(encoding='UTF-8')).hexdigest()
     7 
     8         d_node = {}
     9         d_node['lineno'] = index + 1
    10         d_node['line'] = data
    11         d_node['md5'] = md5
    12         d_node['dupcnt'] = 0
    13         d_node['prev'] = index - 1
    14         d_node['next'] = index + 1
    15         if index == 0:
    16             d_node['prev'] = None
    17         if index == len(l_in) - 1:
    18             d_node['next'] = None
    19         l_out.append(d_node)
    20 
    21         index += 1
    22     return l_out

    很简单,直接采用尾插法搞定。

    03 - 将双向链表中的包含有重复行的结点处理掉

     1 def omit_doubly_linked_list(l_dll):
     2     for curr_node in l_dll:
     3         prev_node_index = curr_node['prev']
     4         next_node_index = curr_node['next']
     5 
     6         if prev_node_index is None:  # the head node
     7             prev_node = None
     8             continue
     9         else:
    10             prev_node = l_dll[prev_node_index]
    11 
    12         if next_node_index is None:  # the tail node
    13             next_node = None
    14         else:
    15             next_node = l_dll[next_node_index]
    16 
    17         if curr_node['md5'] != prev_node['md5']:
    18             continue
    19 
    20         # Update dupcnt of previous node
    21         prev_node['dupcnt'] += 1
    22 
    23         # Remove current node
    24         if next_node is not None:
    25             next_node['prev'] = curr_node['prev']
    26         if prev_node is not None:
    27             prev_node['next'] = curr_node['next']

    如果当前行的md5跟前一行一样,那说明就重复了。处理的方法如下:

    • 将前一个结点的重复计数器(dupcnt)加1;
    • 把当前结点从双向链表上摘掉(这里我们只修改前驱结点的next和后继结点的prev, 不做实际的删除,因为没必要)。

    也许你会问为什么采用md5比较而不采用直接的文本行比较,个人觉得先把文本行的md5算出后,再使用md5比较会更好一些,尤其是文本行很长的时候,因为md5(占128位)的输出总是32个字符。

    04 - 遍历处理后的双向链表

     1 def traverse_doubly_linked_list(l_dll):
     2     l_out = []
     3 
     4     node_index = None
     5     if len(l_dll) > 0:
     6         node_index = 0
     7 
     8     while (node_index is not None):  # <==> p != NULL
     9         curr_node = l_dll[node_index]
    10 
    11         msg = '%6d	%s' % (curr_node['lineno'], curr_node['line'])
    12         l_out.append(msg)
    13 
    14         #
    15         # 1) If dupcnt is 0, it means subsequent lines don't repeat current
    16         #    line, just go to visit the next node
    17         # 2) If dupcnt >= 1, it means subsequent lines repeat the current line
    18         #    a) If dupcnt is 1, i.e. only one line repeats, just pick it up
    19         #    b) else save message like '...<repeats X times>...'
    20         #
    21         if curr_node['dupcnt'] == 0:
    22             node_index = curr_node['next']
    23             continue
    24         elif curr_node['dupcnt'] == 1:
    25             msg = '%6d	%s' % (curr_node['lineno'] + 1, curr_node['line'])
    26         else:  # i.e. curr_node['dupcnt'] > 1
    27             msg = '%s	...<repeats %d times>...' % (' ' * 6,
    28                                                     curr_node['dupcnt'])
    29         l_out.append(msg)
    30 
    31         node_index = curr_node['next']
    32 
    33     return l_out
    • 如果当前结点的dupcnt为0,说明它后面的行与之不同,直接打印;
    • 如果当前结点的dupcnt为1,说明它后面的行与之相同,那么打印当前行,再打印下一行,注意行号得加一;
    • 如果当前结点的dupcnt为N(>1),说明它后面有N行与之重复了,那么打印当前行并再打印...<repeates N times>...。

    注意:头结点的prev和尾结点的next都被定义为None。我们因此可以做类C的遍历。典型的C遍历链表是这样的:

    for (p = head; p != NULL; p = p->next)
        /* print p->data */

    到此为止,在Python中实现一个简单的双向链表就搞定了。其特点是

    • 用None代表NULL;
    • 头结点的prev指针的值和尾结点的next指针的值均为None
    • 中间结点的prev指针的值是其前趋结点的下标
    • 中间结点的next指针的值后继结点的下标。

    完整的代码实现如下:

      1 #!/usr/bin/python3
      2 
      3 import sys
      4 import hashlib
      5 import getopt
      6 
      7 TC_LOG_OUTPUT_RAW = False
      8 
      9 
     10 def init_doubly_linked_list(l_in):
     11     #
     12     # Here is the node definition of the doubly linked list
     13     #
     14     #   struct node {
     15     #       int           lineno;
     16     #       char          *text;
     17     #       char          *md5;
     18     #       char          *dupcnt; /* duplicated counter */
     19     #       struct node   *prev;
     20     #       struct node   *next;
     21     #   }
     22     #
     23     l_out = []
     24     index = 0
     25     for text in l_in:
     26         data = text.strip().rstrip()
     27         md5 = hashlib.md5(data.encode(encoding='UTF-8')).hexdigest()
     28 
     29         d_node = {}
     30         d_node['lineno'] = index + 1
     31         d_node['line'] = data
     32         d_node['md5'] = md5
     33         d_node['dupcnt'] = 0
     34         d_node['prev'] = index - 1
     35         d_node['next'] = index + 1
     36         if index == 0:
     37             d_node['prev'] = None
     38         if index == len(l_in) - 1:
     39             d_node['next'] = None
     40         l_out.append(d_node)
     41 
     42         index += 1
     43     return l_out
     44 
     45 
     46 def omit_doubly_linked_list(l_dll):
     47     #
     48     # Core algorithm to omit repeated lines saved in the doubly linked list
     49     #
     50     #   prev_node = curr_node->prev;
     51     #   next_node = curr_node->next;
     52     #
     53     #   if (curr_node->md5 == prev_node.md5) {
     54     #       prev_node.dupcnt++;
     55     #
     56     #       /* remove current node */
     57     #       next_node->prev = curr_node->prev;
     58     #       prev_node->next = curr_node->next;
     59     #   }
     60     #
     61     for curr_node in l_dll:
     62         prev_node_index = curr_node['prev']
     63         next_node_index = curr_node['next']
     64 
     65         if prev_node_index is None:  # the head node
     66             prev_node = None
     67             continue
     68         else:
     69             prev_node = l_dll[prev_node_index]
     70 
     71         if next_node_index is None:  # the tail node
     72             next_node = None
     73         else:
     74             next_node = l_dll[next_node_index]
     75 
     76         if curr_node['md5'] != prev_node['md5']:
     77             continue
     78 
     79         # Update dupcnt of previous node
     80         prev_node['dupcnt'] += 1
     81 
     82         # Remove current node
     83         if next_node is not None:
     84             next_node['prev'] = curr_node['prev']
     85         if prev_node is not None:
     86             prev_node['next'] = curr_node['next']
     87 
     88 
     89 def traverse_doubly_linked_list(l_dll):
     90     #
     91     # Core algorithm to traverse the doubly linked list
     92     #
     93     #   p = l_dll;
     94     #   while (p != NULL) {
     95     #       /* print p->lineno and p->text */
     96     #
     97     #       if (p->dupcnt == 0) {
     98     #           p = p->next;
     99     #           continue;
    100     #       }
    101     #
    102     #       if (p->dupcnt == 1)
    103     #           /* print p->lineno + 1 and p->text */
    104     #       else /* i.e. > 1 */
    105     #           printf("...<repeats %d times>...", p->dupcnt);
    106     #
    107     #       p = p->next;
    108     #   }
    109     #
    110     l_out = []
    111 
    112     node_index = None
    113     if len(l_dll) > 0:
    114         node_index = 0
    115 
    116     while (node_index is not None):  # <==> p != NULL
    117         curr_node = l_dll[node_index]
    118 
    119         msg = '%6d	%s' % (curr_node['lineno'], curr_node['line'])
    120         l_out.append(msg)
    121 
    122         #
    123         # 1) If dupcnt is 0, it means subsequent lines don't repeat current
    124         #    line, just go to visit the next node
    125         # 2) If dupcnt >= 1, it means subsequent lines repeat the current line
    126         #    a) If dupcnt is 1, i.e. only one line repeats, just pick it up
    127         #    b) else save message like '...<repeats X times>...'
    128         #
    129         if curr_node['dupcnt'] == 0:
    130             node_index = curr_node['next']
    131             continue
    132         elif curr_node['dupcnt'] == 1:
    133             msg = '%6d	%s' % (curr_node['lineno'] + 1, curr_node['line'])
    134         else:  # i.e. curr_node['dupcnt'] > 1
    135             msg = '%s	...<repeats %d times>...' % (' ' * 6,
    136                                                     curr_node['dupcnt'])
    137         l_out.append(msg)
    138 
    139         node_index = curr_node['next']
    140 
    141     return l_out
    142 
    143 
    144 def print_refined_text(l_lines):
    145     l_dll = init_doubly_linked_list(l_lines)
    146     omit_doubly_linked_list(l_dll)
    147     l_out = traverse_doubly_linked_list(l_dll)
    148     for line in l_out:
    149         print(line)
    150 
    151 
    152 def print_raw_text(l_lines):
    153     lineno = 0
    154     for line in l_lines:
    155         lineno += 1
    156         line = line.strip().rstrip()
    157         print('%6d	%s' % (lineno, line))
    158 
    159 
    160 def usage(prog):
    161     sys.stderr.write('Usage: %s [-r] <logfile>
    ' % prog)
    162 
    163 
    164 def main(argc, argv):
    165     shortargs = ":r"
    166     longargs = ["raw"]
    167     try:
    168         options, rargv = getopt.getopt(argv[1:], shortargs, longargs)
    169     except getopt.GetoptError as err:
    170         sys.stderr.write("%s
    " % str(err))
    171         usage(argv[0])
    172         return 1
    173 
    174     for opt, arg in options:
    175         if opt in ('-r', '--raw'):
    176             global TC_LOG_OUTPUT_RAW
    177             TC_LOG_OUTPUT_RAW = True
    178         else:
    179             usage(argv[0])
    180             return 1
    181 
    182     rargc = len(rargv)
    183     if rargc < 1:
    184         usage(argv[0])
    185         return 1
    186 
    187     logfile = rargv[0]
    188     with open(logfile, 'r') as file_handle:
    189         if TC_LOG_OUTPUT_RAW:
    190             print_raw_text(file_handle.readlines())
    191         else:
    192             print_refined_text(file_handle.readlines())
    193 
    194     return 0
    195 
    196 if __name__ == '__main__':
    197     sys.exit(main(len(sys.argv), sys.argv))

    测试运行如下:

    $ ./foo.py /tmp/a.log > /tmp/a && cat /tmp/a
         1    <<<test_start>>>
         2    tag=dio30 stime=1574695439
         3    cmdline="diotest6 -b 65536 -n 100 -i 100 -o 1024000"
         4    contacts=""
         5    analysis=exit
         6    <<<test_output>>>
         7    diotest06    1  TPASS  :  Read with Direct IO, Write without
         8    diotest06    2  TFAIL  :  diotest6.c:150: readv failed, ret = 1269760
         9    diotest06    3  TFAIL  :  diotest6.c:215: Write Direct-child 83 failed
        10    diotest06    1  TPASS  :  Read with Direct IO, Write without
        ...<repeats 7 times>...
        18    diotest06    2  TFAIL  :  diotest6.c:334: Write with Direct IO, Read without
        19    diotest06    1  TPASS  :  Read with Direct IO, Write without
        20    diotest06    1  TPASS  :  Read with Direct IO, Write without
        21    diotest06    2  TFAIL  :  diotest6.c:334: Write with Direct IO, Read without
        22    diotest06    1  TPASS  :  Read with Direct IO, Write without
        23    diotest06    2  TFAIL  :  diotest6.c:334: Write with Direct IO, Read without
        ...<repeats 2 times>...
        26    diotest06    3  TPASS  :  Read, Write with Direct IO
        27    diotest06    0  TINFO  :  1/3 testblocks failed
        28    incrementing stop
        29    <<<execution_status>>>
        30    initiation_status="ok"
        31    duration=697 termination_type=exited termination_id=1 corefile=no
        32    cutime=63573 cstime=6179
        33    <<<test_end>>>
    $ ./foo.py -r /tmp/a.log > /tmp/b && cat /tmp/b
         1    <<<test_start>>>
         2    tag=dio30 stime=1574695439
         3    cmdline="diotest6 -b 65536 -n 100 -i 100 -o 1024000"
         4    contacts=""
         5    analysis=exit
         6    <<<test_output>>>
         7    diotest06    1  TPASS  :  Read with Direct IO, Write without
         8    diotest06    2  TFAIL  :  diotest6.c:150: readv failed, ret = 1269760
         9    diotest06    3  TFAIL  :  diotest6.c:215: Write Direct-child 83 failed
        10    diotest06    1  TPASS  :  Read with Direct IO, Write without
        11    diotest06    1  TPASS  :  Read with Direct IO, Write without
        12    diotest06    1  TPASS  :  Read with Direct IO, Write without
        13    diotest06    1  TPASS  :  Read with Direct IO, Write without
        14    diotest06    1  TPASS  :  Read with Direct IO, Write without
        15    diotest06    1  TPASS  :  Read with Direct IO, Write without
        16    diotest06    1  TPASS  :  Read with Direct IO, Write without
        17    diotest06    1  TPASS  :  Read with Direct IO, Write without
        18    diotest06    2  TFAIL  :  diotest6.c:334: Write with Direct IO, Read without
        19    diotest06    1  TPASS  :  Read with Direct IO, Write without
        20    diotest06    1  TPASS  :  Read with Direct IO, Write without
        21    diotest06    2  TFAIL  :  diotest6.c:334: Write with Direct IO, Read without
        22    diotest06    1  TPASS  :  Read with Direct IO, Write without
        23    diotest06    2  TFAIL  :  diotest6.c:334: Write with Direct IO, Read without
        24    diotest06    2  TFAIL  :  diotest6.c:334: Write with Direct IO, Read without
        25    diotest06    2  TFAIL  :  diotest6.c:334: Write with Direct IO, Read without
        26    diotest06    3  TPASS  :  Read, Write with Direct IO
        27    diotest06    0  TINFO  :  1/3 testblocks failed
        28    incrementing stop
        29    <<<execution_status>>>
        30    initiation_status="ok"
        31    duration=697 termination_type=exited termination_id=1 corefile=no
        32    cutime=63573 cstime=6179
        33    <<<test_end>>>

    用meld对照/tmp/a和/tmp/b截图如下:

  • 相关阅读:
    Python: Best Way to Exchange Keys with Values in a Dictionary?
    install pymssql on centos
    centos 5.5 deploy full procedure
    centos 5.5 deploy full procedure
    change defaut python to 2.6.5 on centos
    Python: Best Way to Exchange Keys with Values in a Dictionary?
    install eventlet ,redis,dreque on centos
    install freetds on centos
    use alias
    install eventlet ,redis,dreque on centos
  • 原文地址:https://www.cnblogs.com/idorax/p/12023582.html
Copyright © 2011-2022 走看看