zoukankan      html  css  js  c++  java
  • python利用scrapy框架爬取起点

    先上自己做完之后回顾细节和思路的东西,之后代码一起上。

    1.Mongodb 建立一个叫QiDian的库,
    然后建立了一个叫Novelclass(小说类别表)
    Novelclass(可以把一级类别二级类别都存进去:玄幻--一级类别,东方玄幻--二级类别)的表

    client = pymongo.MongoClient(host="127.0.0.1")
    db = client.QiDian
    collection = db.Novelclass

    2.用parse回调方法,获得一级类别。循环取出(要注意拼接问题--"https:"+)。
    并将一级类别存入Mongodb(注意一级类别的pid此时为None)。一级类别的链接不需要存入redis数据库(一级类别的链接只为了找到二级)。

    3.此时获取二级类别(东方玄幻),此时回调parse方法
    取得二级类别(名称+链接)同时二级类别名称与一级类别名称放入同一个Mongodb(Novelclass)里,链接则存入redis库(classid = self.insertMongo(name[0],pid),self.pushRedis(classid,url,pid))

    def insertMongo(self,classname,pid):
    classid = collection.insert({'classname':classname,'pid':pid})
    return classid
    def pushRedis(self,classid,url,pid,):
    novelurl = '%s,%s,%s' %(classid,url,pid)
    r.lpush('novelurl',novelurl)
    此时第一步完成。一级类别与二级类别都获取到。

    4.有了二级链接(东方玄幻),接下来要获取每个二级链接下的小说名字与链接。
    (同样,将小说名字存入Mongodb--Novelname表,链接存入到redis--novelnameurl)
    这里注意我们定义了一个字典,(为了我们能取到arr[0]--也就是二级的id(东方玄幻的id))
    因为我们要确定每本小说是属于哪一类(东方玄幻或者西方玄幻。)
    dict = {}
    novelurl = bytes.decode(item)
    arr = novelurl.split(',') # 分割字符串
    qidianNovelSpider.start_urls.append(arr[1])
    pid = arr[0] 二级(东方玄幻)的_id,也就是流水id,注意不要取成东方玄幻的pid(他的pid则是玄幻的id)
    url = arr[1] 二级(东方玄幻)的链接
    self.dict[url] = {"pid":pid,"num":0}
    此时num是为了控制我们爬几页。
    同样一个parse回调方法,因为我们要去下一页,所以我们要确定取到的链接是相同
    classInfo = self.dict[response.url]--response.url固定就是这么写
    pid = classInfo['pid']--确定了那么pid=arr[0]
    num = classInfo['num']

    if num>3:--------此处num的用处就来了,我只去每个二级链接的前4页(因为你是取完了前四页才循环回来,所以不是3页)
    return None
    同样注意链接的拼接问题。(不然会报keyerror错误)
    取到就分别把名字和链接存入Mongodb和redis
    classid = collection.insert({'novelname': name, 'pid': pid})此时pid就是(东方玄幻的id也就是唯一id,而不是玄幻的id)
    print(name)
    self.pushRedis(classid, c, pid)-----(classid就是流水id--_id,c是拼接后的链接,pid则是东方玄幻的id)
    现在分类第一页能取,现在开始写下一页的,
    hxs = HtmlXPathSelector(response)
    hx = hxs.select('//li[@class="lbf-pagination-item"]/a[@class="lbf-pagination-next "]')
    urls = hx.select("@href").extract()
    d = "https:" + urls[0]
    classInfo['num'] +=1-------(每取一页,num就+1)
    self.dict[d] = classInfo
    print(d)
    request = Request(d, callback=self.parse)-------(调用上一个回调方法,也就是取那页名字和链接那个地方。)
    yield request
    这段代码就是为了取到下一页的链接然后去调用上一个方法,把下一页的名字和链接都拿下来。
    之后就是入库操作,上面有。
    Mongodb---Novelname,Redis----novelnameurl

    5.接下来要做的是更新书籍信息(也就是把novelname表更新)
    现在Mongodb----novelname表只有书籍名字,而没有具体信息。(所以要把作者,签约,连载或完结,免费或Vip都更新进去)
    注意:我是更新而不是新建表。所以说我取得链接同样还是novelnameurl里的,只不过把得到的信息不用另起一个表,而是插入之前有的novelname
    client = pymongo.MongoClient(host="127.0.0.1")
    db = client.QiDian
    collection = db.Novelname(这个表会同上一个py文件的相同)
    同样的,我还是需要(东方玄幻的_id)
    pid = arr[0]
    url = arr[1]
    self.dict[url] = {"pid":pid}
    不过这次是为了更新Mongodb数据库准备的,
    nameInfo = self.dict[response.url]
    pid1 = nameInfo['pid']
    pid = ObjectId(pid1)-------(此处是为了等我更新时,"_id"这个键对应相同的ObjectId)
    之后就是取信息。
    hx = hxs.select('//div[@class="book-info "]/h1/span/a[@class="writer"]')
    hx1 =hxs.select('//p[@class="tag"]/span[@class="blue"]')
    hx2 =hxs.select('//p[@class="tag"]/a[@class="red"]')
    for secItem in hx:
    writer = secItem.select("text()").extract()
    print(writer)
    for secItem1 in hx1:
    state = secItem1.select("text()").extract()
    print(state)
    for secItem2 in hx2:
    classes = secItem2.select("text()").extract()
    print(classes)
    可以这么取并且打印出来也是顺序,而不是说先把每个小说的writer都打印出来,在打印state。
    更新Mongodb----novelname表
    db.Novelname.update({"_id": pid}, {"$set": {"writer": writer, "state": state, "classes": classes}})
    此时上面的pid就起了作用,更新完成。

    6.接下来就是爬取每本小说的章节名和链接了。(同上面的取小说名字,只不过更简单,没有下一页)
    这里还是要注意id问题,章节名称和链接都要对应上每本小说自己的_id(而不是二级的pid)。
    插入Mongodb-----Chaptername,redis----chapterurl

    7.最后一步,根据章节链接取小说内容,因为我们取了每本小说的所有的链接,所以也不用考虑下一章的问题。
    同时,我们取完内容,也要将内容更新到章节表里,这里我们需要注意的是,我们取得小说的内容是p标签的(是字符串形式),
    所以说我们插入Mongodb就遇到了问题,字符串放进去,一条就给你一个id,这不是想要的,要的是一章小说的就一个_id就好。
    用到了字符串拼接。
    ii=""-------先给个空的
    hx = hxs.select('//div[@class="read-content j_readContent"]/p')

    for secItem in hx:
    contents = secItem.select("text()").extract()
    content1 = contents[0]-----取到一个p下的内容
    # print(content1)
    ii=content1+ii-------取到的就加进去
    print(ii)------我们想要的结果
    最后更新进Chaptername,
    db.Chaptername.update({"_id": pid}, {"$set": {"content": ii}})


    Mongodb(Novelclass(一级,二级--玄幻,东方玄幻);Novelname(小说名字,----之后更新进作者,连载等等);Chaptername(章节名字,----更新进章节对应的内容))
    Redis(novelurl(单纯二级链接---东方玄幻,用不到一级链接);novelnameurl(小说名字链接);chapterurl(章节链接))

    第一个py文件

    # -*- coding: utf-8 -*-
    import re
    from urllib.request import urlopen
    from scrapy.http import Request
    # from urllib.request import Request
    from bs4 import BeautifulSoup
    from lxml import etree
    import pymongo
    import scrapy
    from scrapy.selector import HtmlXPathSelector
    client = pymongo.MongoClient(host="127.0.0.1")
    db = client.QiDian
    collection = db.Novelclass          #表名classification
    
    
    import redis        #导入redis数据库
    r = redis.Redis(host='127.0.0.1', port=6379, db=0)
    
    class qidianClassSpider(scrapy.Spider):
        name = "qidianClass2"
        allowed_domains = ["qidian.com"]   #允许访问的域
        start_urls = [
            "https://www.qidian.com/all",
        ]
    
        #每爬完一个网页会回调parse方法
        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            hx = hxs.select('//div[@class="work-filter type-filter"]/ul[@type="category"]/li[@class=""]/a')
            for secItem in hx:
                url = secItem.select("@href").extract()
                c = "https:"+url[0]
                name = secItem.select("text()").extract()
                classid = self.insertMongo(name[0],None)
                print(c)
                # a = db.Novelclass.find()
                # for item in a:
                #     print(item.get('_id'))
                # b = item.get('_id')
                # novelurl = '%s,%s' % (item.get('_id'), c)
                # r.lpush('novelurl', novelurl)
                request = Request(c,callback=lambda response,pid=str(classid):self.parse_subclass(response,pid))
                yield request
        def parse_subclass(self, response,pid):
            hxs = HtmlXPathSelector(response)
            hx = hxs.select('//div[@class="sub-type"]/dl[@class=""]/dd[@class=""]/a')
            for secItem in hx:
                urls = secItem.select("@href").extract()
                url = "https:" + urls[0]
                name = secItem.select("text()").extract()
                classid = self.insertMongo(name[0],pid)
                self.pushRedis(classid,url,pid)
    
        def insertMongo(self,classname,pid):
            classid = collection.insert({'classname':classname,'pid':pid})
            return classid
        def pushRedis(self,classid,url,pid,):
            novelurl = '%s,%s,%s' %(classid,url,pid)
            r.lpush('novelurl',novelurl)
    

      第二个py文件

    # -*- coding: utf-8 -*-
    import re
    from urllib.request import urlopen
    from scrapy.http import Request
    import pymongo
    import scrapy
    from time import sleep
    from scrapy.selector import HtmlXPathSelector
    
    client = pymongo.MongoClient(host="127.0.0.1")
    db = client.QiDian
    collection = db.Novelname
    
    import redis  # 导入redis数据库
    
    r = redis.Redis(host='127.0.0.1', port=6379, db=0)
    
    ii = 0
    
    
    class qidianNovelSpider(scrapy.Spider):
        name = "qidianClass3"
        allowed_domains = ["qidian.com"]
        dict = {}
        start_urls = []
    
        def __init__(self):  # 定义一个方法
    
            a = r.lrange('novelurl', 0, -1)
            # ii = 0
            for item in a:
                novelurl = bytes.decode(item)
                arr = novelurl.split(',')  # 分割字符串
                qidianNovelSpider.start_urls.append(arr[1])
                pid = arr[0]
                url = arr[1]
                self.dict[url] = {"pid":pid,"num":0}
                # ii +=1
                # if ii>3:
                #     break
                    # qidianNovelSpider.start_urls = start_urls
                # print(start_urls)
    
        def parse(self, response):
    
            classInfo = self.dict[response.url]
            pid = classInfo['pid']
            num = classInfo['num']
            # print(self.dict)
            if num>3:
                return None
            hxs = HtmlXPathSelector(response)
            hx = hxs.select('//div[@class="book-mid-info"]/h4/a')
            for secItem in hx:
                url = secItem.select("@href").extract()
                c = "https:" + url[0]
                name = secItem.select("text()").extract()
                classid = collection.insert({'novelname': name, 'pid': pid})
                print(name)
                self.pushRedis(classid, c, pid)
    
            print('-----------递归--------------')
    
            hxs = HtmlXPathSelector(response)
            hx = hxs.select('//li[@class="lbf-pagination-item"]/a[@class="lbf-pagination-next "]')
            urls = hx.select("@href").extract()
            d = "https:" + urls[0]
            classInfo['num'] +=1
            self.dict[d] = classInfo
            print(d)
            request = Request(d, callback=self.parse)
            yield request
            print('--------end--------------')
    
        def pushRedis(self, classid, c, pid):
            novelnameurl = '%s,%s,%s' % (classid, c, pid)
            r.lpush('novelnameurl', novelnameurl)
    

      第三个py文件

    # -*- coding: utf-8 -*-
    import re
    from urllib.request import urlopen
    from scrapy.http import Request
    import pymongo
    import scrapy
    from time import sleep
    from scrapy.selector import HtmlXPathSelector
    from bson.objectid import ObjectId
    
    client = pymongo.MongoClient(host="127.0.0.1")
    db = client.QiDian
    collection = db.Novelname
    
    import redis  # 导入redis数据库
    
    r = redis.Redis(host='127.0.0.1', port=6379, db=0)
    
    # ii = 0
    
    
    class qidianNovelSpider1(scrapy.Spider):
        name = "qidianClass4"
        allowed_domains = ["qidian.com"]
        dict = {}
        start_urls = []
    
        def __init__(self):  # 定义一个方法
    
            a = r.lrange('novelnameurl', 0, -1)
            # ii = 0
            for item in a:
                novelnameurl = bytes.decode(item)
                arr = novelnameurl.split(',')  # 分割字符串
                qidianNovelSpider1.start_urls.append(arr[1])
                pid = arr[0]
                url = arr[1]
                self.dict[url] = {"pid":pid}
    
        def parse(self, response):
            nameInfo = self.dict[response.url]
            pid1 = nameInfo['pid']
            pid = ObjectId(pid1)
            print(pid)
            hxs = HtmlXPathSelector(response)
            hx = hxs.select('//div[@class="book-info "]/h1/span/a[@class="writer"]')
            hx1 =hxs.select('//p[@class="tag"]/span[@class="blue"]')
            hx2 =hxs.select('//p[@class="tag"]/a[@class="red"]')
            for secItem in hx:
                writer = secItem.select("text()").extract()
                print(writer)
            for secItem1 in hx1:
                state = secItem1.select("text()").extract()
                print(state)
            for secItem2 in hx2:
                classes = secItem2.select("text()").extract()
                print(classes)
                # for item in a:
                #     b = item.get('_id')
                #     print(b)
    
                db.Novelname.update({"_id": pid}, {"$set": {"writer": writer, "state": state, "classes": classes}})
                print('------------------------------------------')
    
    
                # classid = collection.insert({'novelname': name, 'pid': Pid})
                # print(name)
                # self.pushRedis(classid, c, Pid)
    

      第四个py文件

    # -*- coding: utf-8 -*-
    import re
    from urllib.request import urlopen
    from scrapy.http import Request
    import pymongo
    import scrapy
    from time import sleep
    from scrapy.selector import HtmlXPathSelector
    from bson.objectid import ObjectId
    
    client = pymongo.MongoClient(host="127.0.0.1")
    db = client.QiDian
    collection = db.Chaptername
    
    import redis  # 导入redis数据库
    
    r = redis.Redis(host='127.0.0.1', port=6379, db=0)
    
    
    class qidianNovelSpider1(scrapy.Spider):
        name = "qidianClass5"
        allowed_domains = ["qidian.com"]
        dict = {}
        start_urls = []
    
        def __init__(self):  # 定义一个方法
    
            a = r.lrange('novelnameurl', 0, -1)
            # ii = 0
            for item in a:
                novelnameurl = bytes.decode(item)
                arr = novelnameurl.split(',')  # 分割字符串
                qidianNovelSpider1.start_urls.append(arr[1])
                pid = arr[0]
                url = arr[1]
                self.dict[url] = {"pid":pid}
                print(url)
    
        def parse(self, response):
            nameInfo = self.dict[response.url]
            pid = nameInfo['pid']
            hxs = HtmlXPathSelector(response)
            hx = hxs.select('//div[@class="volume-wrap"]/div[@class="volume"]/ul[@class="cf"]/li/a[@target="_blank"]')
            for secItem in hx:
                urls = secItem.select("@href").extract()
                url = "https:"+urls[0]
                chapter = secItem.select("text()").extract()
    
                print(chapter)
                print(url)
                classid = collection.insert({'chaptername': chapter, 'pid': pid})
                self.pushRedis(classid,url, pid)
    
        def pushRedis(self, classid, url, pid):
            chapterurl = '%s,%s,%s' % (classid, url, pid)
            r.lpush('chapterurl', chapterurl)
    

      第五个py文件

    # -*- coding: utf-8 -*-
    import re
    from urllib.request import urlopen
    from scrapy.http import Request
    import pymongo
    import scrapy
    from time import sleep
    from scrapy.selector import HtmlXPathSelector
    from bson.objectid import ObjectId
    
    client = pymongo.MongoClient(host="127.0.0.1")
    db = client.QiDian
    collection = db.Chaptername
    
    import redis  # 导入redis数据库
    
    r = redis.Redis(host='127.0.0.1', port=6379, db=0)
    
    
    class qidianNovelSpider1(scrapy.Spider):
        name = "qidianClass6"
        allowed_domains = ["qidian.com"]
        dict = {}
        start_urls = []
    
        def __init__(self):  # 定义一个方法
    
            a = r.lrange('chapterurl', 0, -1)
            # ii = 0
            for item in a:
                chapterurl = bytes.decode(item)
                arr = chapterurl.split(',')  # 分割字符串
                qidianNovelSpider1.start_urls.append(arr[1])
                pid = arr[0]
                url = arr[1]
                self.dict[url] = {"pid":pid}
                # print(url)
    
    
        def parse(self, response):
            nameInfo = self.dict[response.url]
            pid1 = nameInfo['pid']
            pid = ObjectId(pid1)
            hxs = HtmlXPathSelector(response)
            ii=""
            hx = hxs.select('//div[@class="read-content j_readContent"]/p')
            for secItem in hx:
                contents = secItem.select("text()").extract()
                content1 = contents[0]
                # print(content1)
                ii=content1+ii
    
                # content = bytes(content1,'GBK')
            # classid = collection.insert({'content': ii, 'pid': pid1})
            db.Chaptername.update({"_id": pid}, {"$set": {"content": ii}})
                # print(content)
                # f = open('1.txt','wb')
                # f.write(content)
                # f.close()
    

      好了 ,大功告成

  • 相关阅读:
    上下文管理
    复习1
    描述符
    迭代器斐波那契数列
    迭代器协议
    __call__ 方法
    析构方法__del__
    __module__和class
    1.8命令执行顺序控制与管道(学习过程)
    1.7文件系统操作与磁盘管理(学习过程)
  • 原文地址:https://www.cnblogs.com/wangyuhangboke/p/7954905.html
Copyright © 2011-2022 走看看