zoukankan html css js c++ java

python HTMLparser

1.概述

 1 如果我们要编写一个搜索引擎，第一步是用爬虫把目标网站的页面抓下来，
 2 第二步就是解析该HTML页面，看看里面的内容到底是新闻、图片还是视频。
 3 
 4 假设第一步已经完成了，第二步应该如何解析HTML呢？
 5 
 6 HTML本质上是XML的子集，但是HTML的语法没有XML那么严格，所以不能用标准的DOM或SAX来解析HTML。
 7 也可以用 re正则表达式
 8 scrapy 框架下的css选择器或者xpath
 9 
10 仁者见仁智者见智

2.HTMLparser

 1 # 使用时需要定义一个从类HTMLParser继承的类，重定义函数：
 2 # handle_starttag( tag, attrs)
 3 #handle_startendtag( tag, attrs)
 4 # handle_endtag( tag)
 5 
 6 # 来实现自己需要的功能。
 7 
 8 # tag是的html标签，attrs是 (属性，值)元组(tuple)的列表(list)。
 9 # HTMLParser自动将tag和attrs都转为小写
10 
11 
12 from html.parser import HTMLParser  
13 class MyHTMLParser(HTMLParser):   
14     def __init__(self):   
15         HTMLParser.__init__(self)   
16         self.links = []   
17     def handle_starttag(self, tag, attrs):   
18         #print "Encountered the beginning of a %s tag" % tag   
19         if tag == "a":   
20             if len(attrs) == 0:   
21                 pass   
22             else:   
23                 for (variable, value) in attrs:   
24                     if variable == "href":   
25                         self.links.append(value)   
26                      
27 if __name__ == "__main__":   
28     html_code = """ <a href="www.google.com"> google.com</a> <A Href="www.pythonclub.org"> PythonClub </a> <A HREF = "www.sina.com.cn"> Sina </a> """   
29     hp = MyHTMLParser()   
30     hp.feed(html_code)   
31     hp.close()   
32     print(hp.links)  
33 # 运行结果
34 # ['www.google.com', 'www.pythonclub.org', 'www.sina.com.cn']

3.总结

1 个人观点 2 如果是做搜索引擎建议还是用scrapy框架

参照：https://www.cnblogs.com/mfryf/p/3691563.html

查看全文

相关阅读:
【SpringBoot系列】邮件发送
 【问题】InteliJ IDEA生成可执行jar运行提示没有主清单属性
 【设计模式】单例设计模式
 【C3P0】C3P0
【JDBC】JDBC学习（一）
react hook 防抖
 主线程宏任务微任务
 vue 2.0 渲染dom过程
 源码阅读笔记，杂乱
 vue 3.0 响应式原理

原文地址：https://www.cnblogs.com/jum-bolg/p/11094743.html