zoukankan      html  css  js  c++  java
  • Screen scraping 2

    Using HTMLPareser

    Using HTMLParser simply means subclassing it, and overriding various event-handling methods such as handle_starttag or handle_data.

    Handle_starttag(tag, attrs): When a start tag is found. Attrs is a sequence of (name, value) pairs.

    Handle_startendtag(tag, attrs): for empty tags; default handles start and end separately

    Handle_endtag(tag): when end tag is found

    Handle_data(data): for textual data

    Handle_charref(ref): for character references of the form &#ref

    Handle_entityref(name): for entity references of the form &name

    Handle_decl(decl): for declarations of the form <!...>

    Handle_pi(data): for processing instructions

    from urllib import urlopen
    import re
    from HTMLParser import HTMLParser
    
    class Scraper(HTMLParser):
        in_h2 = False
        in_link = False
        
        def handle_starttag(self, tag, attrs):
            attrs = dict(attrs)
            if tag == 'h2':
                self.in_h2 = True
            if tag == 'a' and 'href' in attrs:
                self.in_link = True
                self.chunks = []
                self.url = attrs['href']
                
        def handle_data(self, data):
            if self.in_link:
                self.chunks.append(data)
                
        def handle_endtag(self, tag):
            if tag == 'h2':
                self.in_h2 = False
            if tag == 'a':
                if self.in_h2 and self.in_link:
                    print '%s (%s)' %(''.join(self.chunks), self.url)
                self.in_link = False
    
    text = urlopen("http://www.python.org/community/jobs/").read()
    parser = Scraper()
    parser.feed(text)
    parser.close()
    作者:Shane
    出处:http://bluescorpio.cnblogs.com
    本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
  • 相关阅读:
    购买 Linux VPS 服务器后简单的安全设置
    VPS性能测试:CPU内存,硬盘IO读写,带宽速度,UnixBench和压力测试
    Polysh实现多服务器批量执行shell
    第十一周编程总结
    第十周作业
    第九周编程总结修改
    第九周编程总结
    第八周编程总结
    第7周编程总结
    第七周编程总结啊
  • 原文地址:https://www.cnblogs.com/bluescorpio/p/2513950.html
Copyright © 2011-2022 走看看