zoukankan      html  css  js  c++  java
  • python中HTMLParser简单理解

    找一个网页,例如https://www.python.org/events/python-events/,用浏览器查看源码并复制,然后尝试解析一下HTML,输出Python官网发布的会议时间、名称和地点。

     1 from html.parser import HTMLParser
     2 from html.entities import name2codepoint
     3 
     4 class MyHTMLParser(HTMLParser):
     5 
     6   in_title = False
     7   in_loca = False
     8   in_time = False
     9 
    10   def handle_starttag(self,tag,attrs):
    11     if ('class','event-title') in attrs:
    12       self.in_title = True
    13     elif ('class','event-location') in attrs:
    14       self.in_loca = True
    15     elif tag == 'time':
    16       self.in_time = True
    17       self.times = []
    18 
    19   def handle_data(self,data):
    20     if self.in_title:
    21       print('-'*50)
    22       print('Title:'+data.strip())
    23     if self.in_loca:
    24       print('Location:'+data.strip())
    25     if self.in_time:
    26       self.times.append(data)
    27   def handle_endtag(self,tag):
    28     if tag == 'h3':self.in_title = False
    29     if tag == 'span':self.in_loca = False
    30     if tag == 'time':
    31       self.in_time = False
    32       print('Time:'+'-'.join(self.times))
    33 parser = MyHTMLParser()
    34 with open('s.html') as html:
    35 parser.feed(html.read())

    重点理解15-17和30-32行,python的HTMLParser在解析网页中的文本时,是按照一个个字符串解析的,

      <h3 class="event-title"><a href="/events/python-events/401/">PyOhio 2016</a></h3>

      <span class="event-location">The Ohio Union at The Ohio State University. 1739 N. High Street, Columbus, OH 43210, USA</span>

      <time datetime="2016-07-29T00:00:00+00:00">29 July &ndash; 01 Aug. <span class="say-no-more"> 2016</span></time>

    在遇到特殊字符串时(例如&ndash;)会直接跳过,将前后作为两个字符串,15-17和30-32的配合是为了获取span中的年份2016

  • 相关阅读:
    PHP strcmp,strnatcmp,strncmp函数的区别
    PHP echo,print_r(expression),var_dump(expression)区别
    PHP包含文件语句include和require的区别
    PHP魔术变量__METHOD__,__FUNCTION__的区别
    解决margin重叠的问题
    冒牌、选择、插入排序算法
    == 和 === 的区别
    Javascript常见浏览器兼容问题
    浏览器常见兼容性问题汇总
    JS中replace()用法举例
  • 原文地址:https://www.cnblogs.com/dongzhuangdian/p/5616948.html
Copyright © 2011-2022 走看看