zoukankan      html  css  js  c++  java
  • Python 通过sgmllib模块解析HTML

    """
    对html文本的解析方案-示例:在标签开始的时候检查标签中的attrs属性,解析出所有的参数的href属性值
    依赖安装:pip install sgmllib3k
    使用方法:
        1.自定义一个类,继承sgmllib的SGMLParser
        2.复写SGMLParser的方法,添加自己自定义的标签处理函数
        3.通过自定义的类的对象的.feed(data)把要解析的数据传入解析器,然后自定义的方法自动生效。
    """
    from urllib import request
    import sgmllib
    
    
    class HandleHtml(sgmllib.SGMLParser):
        """
        自定义HTML解析类
        """
    
        def unknown_starttag(self, tag, attrs):
            """
            任意标签开始被解析时调用
            :param tag: 标签名
            :param attrs: 标签的参数
            :return:
            """
            try:
                for attr in attrs:
                    if attr[0] == 'href':
                        print(f"{attr[0]}:{attr[1]}")
            except:
                pass
    
    
    if __name__ == '__main__':
        response = request.urlopen("http://freebuf.com/")
        page = response.read()
        page = page.decode('utf-8')
    
        # 创建HTML解析对象
        handle_html = HandleHtml()
        # 将数据传入解析器
        handle_html.feed(page)

    输出结果:

    href:https://www.freebuf.com/buf/plugins/wp-favorite-posts/wpfp.css
    href:https://static.3001.net/css/recentcomments/wp-recentcomments.css?ver=2.2.3
    href:https://www.freebuf.com/buf/plugins/gold/assets/css/widget.css?ver=1.3.2.1
    href:https://static.3001.net/css/highslide/highslide.css
    href:https://www.freebuf.com/buf/plugins/cartpauj-pm/style/style.css
    href: https://www.freebuf.com/buf/plugins/simditor/highlight/styles/default.css
    href:https://static.freebuf.com/images/favicon.ico
    href:https://static.3001.net/css/new/header.css
    href:https://static.3001.net/css/new/bootstrap.min.css?ver=2016051701
    href:https://static.3001.net/css/new/swiper-3.4.2.min.css
    href:https://static.3001.net/css/new/model.css?ver=2017112156855
    href:https://static.3001.net/css/new/style.css?ver=2018112123749359438534
    href:http://www.freebuf.com
    href:http://www.freebuf.com
    href:http://job.freebuf.com
    href:#
    ......
  • 相关阅读:
    前端开发—HTML
    初识 Django
    前端开发—BOM对象DOM文档对象操作
    前端开发—jQuery
    前端开发—Javascript
    前端开发—CSS 盒子、浮动、定位
    前端开发—CSS
    html模拟手机页面
    人类简史读书笔记
    正则表达式
  • 原文地址:https://www.cnblogs.com/Jimc/p/10307684.html
Copyright © 2011-2022 走看看