zoukankan      html  css  js  c++  java
  • python中HTMLParser简单理解

    找一个网页,例如https://www.python.org/events/python-events/,用浏览器查看源码并复制,然后尝试解析一下HTML,输出Python官网发布的会议时间、名称和地点。

     1 from html.parser import HTMLParser
     2 from html.entities import name2codepoint
     3 
     4 class MyHTMLParser(HTMLParser):
     5 
     6   in_title = False
     7   in_loca = False
     8   in_time = False
     9 
    10   def handle_starttag(self,tag,attrs):
    11     if ('class','event-title') in attrs:
    12       self.in_title = True
    13     elif ('class','event-location') in attrs:
    14       self.in_loca = True
    15     elif tag == 'time':
    16       self.in_time = True
    17       self.times = []
    18 
    19   def handle_data(self,data):
    20     if self.in_title:
    21       print('-'*50)
    22       print('Title:'+data.strip())
    23     if self.in_loca:
    24       print('Location:'+data.strip())
    25     if self.in_time:
    26       self.times.append(data)
    27   def handle_endtag(self,tag):
    28     if tag == 'h3':self.in_title = False
    29     if tag == 'span':self.in_loca = False
    30     if tag == 'time':
    31       self.in_time = False
    32       print('Time:'+'-'.join(self.times))
    33 parser = MyHTMLParser()
    34 with open('s.html') as html:
    35 parser.feed(html.read())

    重点理解15-17和30-32行,python的HTMLParser在解析网页中的文本时,是按照一个个字符串解析的,

      <h3 class="event-title"><a href="/events/python-events/401/">PyOhio 2016</a></h3>

      <span class="event-location">The Ohio Union at The Ohio State University. 1739 N. High Street, Columbus, OH 43210, USA</span>

      <time datetime="2016-07-29T00:00:00+00:00">29 July &ndash; 01 Aug. <span class="say-no-more"> 2016</span></time>

    在遇到特殊字符串时(例如&ndash;)会直接跳过,将前后作为两个字符串,15-17和30-32的配合是为了获取span中的年份2016

  • 相关阅读:
    牛客练习赛44 A 小y的序列 (模拟,细节)
    牛客假日团队赛10 L 乘积最大 (dp,大数)
    三分查找
    几何基础知识点
    POJ 2318 TOYS
    UVA 11916 Emoogle Grid(大步小步算法(解模方程对数) 快速幂 模的逆)
    UVA 11426 GCD
    Aladdin and the Flying Carpet(算术基本定理)
    算术基本定理
    数论总结帖
  • 原文地址:https://www.cnblogs.com/dongzhuangdian/p/5616948.html
Copyright © 2011-2022 走看看