找一个网页,例如https://www.python.org/events/python-events/,用浏览器查看源码并复制,然后尝试解析一下HTML,输出Python官网发布的会议时间、名称和地点。
1 from html.parser import HTMLParser 2 from html.entities import name2codepoint 3 4 class MyHTMLParser(HTMLParser): 5 6 in_title = False 7 in_loca = False 8 in_time = False 9 10 def handle_starttag(self,tag,attrs): 11 if ('class','event-title') in attrs: 12 self.in_title = True 13 elif ('class','event-location') in attrs: 14 self.in_loca = True 15 elif tag == 'time': 16 self.in_time = True 17 self.times = [] 18 19 def handle_data(self,data): 20 if self.in_title: 21 print('-'*50) 22 print('Title:'+data.strip()) 23 if self.in_loca: 24 print('Location:'+data.strip()) 25 if self.in_time: 26 self.times.append(data) 27 def handle_endtag(self,tag): 28 if tag == 'h3':self.in_title = False 29 if tag == 'span':self.in_loca = False 30 if tag == 'time': 31 self.in_time = False 32 print('Time:'+'-'.join(self.times)) 33 parser = MyHTMLParser() 34 with open('s.html') as html: 35 parser.feed(html.read())
重点理解15-17和30-32行,python的HTMLParser在解析网页中的文本时,是按照一个个字符串解析的,
<h3 class="event-title"><a href="/events/python-events/401/">PyOhio 2016</a></h3>
<span class="event-location">The Ohio Union at The Ohio State University. 1739 N. High Street, Columbus, OH 43210, USA</span>
<time datetime="2016-07-29T00:00:00+00:00">29 July – 01 Aug. <span class="say-no-more"> 2016</span></time>
在遇到特殊字符串时(例如–)会直接跳过,将前后作为两个字符串,15-17和30-32的配合是为了获取span中的年份2016