zoukankan      html  css  js  c++  java
  • spider-抓取页面内容

    # -*- coding: UTF-8 -*-
    from HTMLParser import HTMLParser
    import sys,urllib2,string,re,json
    
    reload(sys)
    sys.setdefaultencoding('utf-8')
    
    class hp(HTMLParser):
    
        def __init__(self):
            self.readingdata_a = False
            self.title = []
            self.usite = []
            HTMLParser.__init__(self)
        
        def handle_starttag(self,tag,attrs):
            #print tag
            if tag == 'a':for h,v in attrs:
                    if v == 'entrylistItemTitle':
                        self.readingdata_a = True
                        self.usite.append(attrs[2][1])
    
        def handle_data(self,data):
            if self.readingdata_a:
                self.title.append(data)
    
        def handle_endtag(self,tag):
            if tag == 'a':
                self.readingdata_a = False
    
        def getdata(self):
            #return zip(self.title,self.usite) 通过zip函数将其一对一合并为tuple
    
            i=0
            listr = []
            while i<len(self.title):
                listr.append(self.title[i] +' : '+self.usite[i])
                i=i+1
            return listr
    
    
    url='http://www.cnblogs.com/dreamer-fish/archive/2016/03.html'
    request = urllib2.Request(url)
    response = urllib2.urlopen(request).read()
    
    yk=hp()
    yk.feed(response)
    dd = yk.getdata()
    
    
    for i in dd:
        print i
    
    yk.close

     结果:

  • 相关阅读:
    面向对象中一些容易混淆的概念
    day12作业
    day10作业
    day09作业
    Where与Having的区别
    排序算法之快速排序
    排序算法之冒泡排序
    jQuery中的100个技巧
    用node.js给图片加水印
    代码高亮美化插件-----SyntaxHighlighter
  • 原文地址:https://www.cnblogs.com/dreamer-fish/p/5377438.html
Copyright © 2011-2022 走看看