zoukankan      html  css  js  c++  java
  • Crawl(1)

    爬贴吧小说。

    爬取该链接中的楼主发言前10页另存为文本文件

    python2.7

    # *-* coding: UTF-8 *-*
    import urllib2
    import re
    
    class BDTB:
        baseUrl = 'http://tieba.baidu.com/p/4896490947?see_lz=&pn='
        def getPage(self, pageNum):
            try:
                url = self.baseUrl+str(pageNum)
                request = urllib2.Request(url)
                response = urllib2.urlopen(request).read()
                return response
            except Exception, e:
                print e    
        def Title(self, pageNum):
            html = self.getPage(pageNum)
            reg = re.compile(r'title="【原创】(.*?)"')
            items = re.findall(reg, html)
            for item in items:
                f = open('text.txt', 'w')
                f.write('标题'+'	'+item)
                f.close()
            return items
        def Text(self, pageNum):
            html = self.getPage(pageNum)
            reg = re.compile(r'd_post_content j_d_post_content ">            (.*?)</div><br>', re.S)
            req = re.findall(reg, html)
            if pageNum == 1:
                req = req[2:]
            for i in req:
                removeAddr = re.compile('<a.*?>|</a>')
                i = re.sub(removeAddr, "", i)
                removeAddr = re.compile('<img.*?>')
                i = re.sub(removeAddr, "", i)
                removeAddr = re.compile('http.*?.html')
                i = re.sub(removeAddr, "", i)
                i = i.replace('<br>', '')
                f = open('text.txt', 'a')
                f.write('
    
    '+i)
                f.close()
    
                
    bdtb = BDTB()
    print 'Crawl is starting....'
    try:
        for i in range(1, 10):
            print 'Crawling Page %s...' % (i)
            bdtb.Title(i)
            bdtb.Text(i)
    except Exception, e:
        print e
  • 相关阅读:
    Linux文件系统
    Linux用户和用户管理
    Linux磁盘管理
    vi编辑器
    Linux常用命令
    Linux进程管理
    servlet运行过程
    http的get和post方法的区别
    千元机小荐
    JavaScript 正则表达式(Reg Exp)
  • 原文地址:https://www.cnblogs.com/dirge/p/6347564.html
Copyright © 2011-2022 走看看