zoukankan      html  css  js  c++  java
  • 第一个python小程序解析网页

    很久就想学python了,但一直找不到合适的项目来练习,python的语法很有意思,写起来很简洁,今天有空边找资料边写出来了这一个小项目。考虑到3.x的python库还不是很多,大部分资料也是python2.x的,所以我用的python2.7来进行

    之前就听说python访问网络很easy,这次是真的体会到了。很简单几句话搞定,不像java那样,再简单的访问都要装饰几层才能使用。

    这次是拿糗事百科的网站,从上面抓取新鲜事并整理打印出来

    不想多说了,以下上代码:

    import urllib2
    import sgmllib
    
    class Entry:
        author=''
        content=''
        pic=''
        up = 0
        down = 0
        tag = ''
        comment = 0
        def to_string(self):
            return '[Entry: author=%s content=%s pic=%s tag=%s up=%d down=%d comment=%d]'\
                %(self.author,self.content,self.pic,self.tag,self.up,self.down,self.comment)
    
    class MyHTMLParser(sgmllib.SGMLParser):
        #所有用到的声明
        #note all the datas
        datas = []
        # all the entries
        entries = []
        #the entry now
        entry = Entry()
        #last Unclosed tag
        div_tag_unclosed = ''
        
        def start_div(self,attrs):
            for name,value in attrs:
                if name =='class' and value == 'content':
                    self.div_tag_unclosed = 'content'
                elif name=='class' and value == 'tags' :
                    self.div_tag_unclosed = 'tags'
                elif name=='class' and value=='up':
                    self.div_tag_unclosed = 'up'
                elif name=='class' and value == 'down':
                    self.div_tag_unclosed = 'down'
                elif name=='class' and value=='comment':
                    self.div_tag_unclosed = 'comment'
                elif name=='class' and value=='author':
                    self.div_tag_unclosed = 'author'
                    self.entry = Entry()
                elif name=='class' and value=='thumb':
                    self.div_tag_unclosed = 'thumb'
                    
        def end_div(self):
            if self.div_tag_unclosed == 'content' :
                self.div_tag_unclosed =''
                self.entry.content =  self.datas.pop().strip()
        def start_a(self,attrs):pass
        def start_img(self,attrs):
            if self.div_tag_unclosed == 'thumb':
                for name,value in attrs:
                    if name=='src':
                        self.div_tag_unclosed =''
                        self.entry.img = value.strip() 
        def end_img(self):pass
        def end_a(self):
            if self.div_tag_unclosed == 'author':
                self.div_tag_unclosed =''
                self.entry.author = self.datas.pop().strip()
            if self.div_tag_unclosed == 'tags':
                self.div_tag_unclosed =''
                self.entry.tag = self.datas.pop().strip()
            elif self.div_tag_unclosed == 'up':
                self.div_tag_unclosed =''
                self.entry.up = int(self.datas.pop().strip())
            elif self.div_tag_unclosed == 'down':
                self.div_tag_unclosed =''
                self.entry.down = int(self.datas.pop().strip())
            elif self.div_tag_unclosed == 'comment':
                self.div_tag_unclosed =''
                self.entry.comment = int(self.datas.pop().strip())
                self.entries.append(self.entry)
        def handle_data(self, data):
    #        print 'data',data
            self.datas.append(data)
    
    #request the url
    response = urllib2.urlopen('http://www.qiushibaike.com/8hr')
    all = response.read()
    
    #parse HTML
    parser = MyHTMLParser()
    parser.feed(all)
    #print all the entries
    for entry in parser.entries:
        print entry.to_string()

    整个程序很简单,用到了urllib2来请求网络,sgmllib来解析Html,由于第一次写python程序,所以写的时候效率很低,尤其是一直想在if后面加上小括号=-=

    文章来自 sheling 的博客园: http://www.cnblogs.com/sheling
    本文版权归作者所有,欢迎转载,但未经作者同意必须保留此段声明,
      且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
    我的独立博客 http://blog.iyestin.com
  • 相关阅读:
    布局重用 include merge ViewStub
    AS 常用插件 MD
    AS 2.0新功能 Instant Run
    AS .ignore插件 忽略文件
    AS Gradle构建工具与Android plugin插件【大全】
    如何开通www国际域名个人网站
    倒计时实现方案总结 Timer Handler
    AS 进行单元测试
    RxJava 设计理念 观察者模式 Observable lambdas MD
    retrofit okhttp RxJava bk Gson Lambda 综合示例【配置】
  • 原文地址:https://www.cnblogs.com/sheling/p/2646761.html
Copyright © 2011-2022 走看看