zoukankan      html  css  js  c++  java
  • 使用mongodb作scrapy爬小说的存储

    一、背景:学习mongodb,考虑把原使用mysql作scrapy爬小说存储的程序修改为使用mongodb作存储。

    二、过程:

    1、安装mongodb

    (1)配置yum repo

    (python) [root@DL ~]# vi /etc/yum.repos.d/mongodb-org-4.0.repo

    [mngodb-org]
    name=MongoDB Repository
    baseurl=http://mirrors.aliyun.com/mongodb/yum/redhat/7Server/mongodb-org/4.0/x86_64/
    gpgcheck=0
    enabled=1

    (2)yum安装

    (python) [root@DL ~]# yum -y install mongodb-org

    (3)启动mongod服务

    (python) [root@DL ~]# systemctl start mongod

    (4)进入mongodb的shell

    (python) [root@DL ~]# mongo
    MongoDB shell version v4.0.20

    ...

    To enable free monitoring, run the following command: db.enableFreeMonitoring()
    To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
    ---
    >

    (5)安装pymongo模块

    (python) [root@DL ~]# pip install pymongo
    Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
    Collecting pymongo
      Downloading https://pypi.tuna.tsinghua.edu.cn/packages/13/d0/819074b92295149e1c677836d72def88f90814d1efa02199370d8a70f7af/pymongo-3.11.0-cp38-cp38-manylinux2014_x86_64.whl (530kB)
         |████████████████████████████████| 532kB 833kB/s
    Installing collected packages: pymongo
    Successfully installed pymongo-3.11.0

    2、修改pipeline.py程序

    (python) [root@localhost xbiquge_w]# vi xbiquge/pipelines.py

     1 # -*- coding: utf-8 -*-
     2 
     3 # Define your item pipelines here
     4 #
     5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
     6 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
     7 import os
     8 import time
     9 from twisted.enterprise import adbapi
    10 from pymongo import MongoClient
    11 
    12 class XbiqugePipeline(object):
    13     conn = MongoClient('localhost',27017)
    14     db = conn.novels #建立数据库novels的连接对象db
    15     #name_novel = ''
    16 
    17     #定义类初始化动作
    18     #def __init__(self):
    19 
    20     #爬虫开始
    21     #def open_spider(self, spider):
    22 
    23         #return
    24     def clearcollection(self, name_collection):
    25         myset = self.db[name_collection]    
    26         myset.remove()
    27 
    28     def process_item(self, item, spider):
    29         #if self.name_novel == '':
    30         self.name_novel = item['name']
    31         self.url_firstchapter = item['url_firstchapter']
    32         self.name_txt = item['name_txt']
    33 
    34         exec('self.db.'+ self.name_novel + '.insert_one(dict(item))')
    35         return item
    36 
    37     #从数据库取小说章节内容写入txt文件
    38     def content2txt(self,dbname,firsturl,txtname):
    39         myset = self.db[dbname]
    40         record_num = myset.find().count() #获取小说章节数量
    41         print(record_num)
    42         counts=record_num
    43         url_c = firsturl
    44         start_time=time.time()  #获取提取小说内容程序运行的起始时间
    45         f = open(txtname+".txt", mode='w', encoding='utf-8')   #写方式打开小说名称加txt组成的文件
    46         for i in range(counts):  #括号中为counts
    47             record_m = myset.find({"url": url_c},{"content":1,"by":1,"_id":0})
    48             record_content_c2a0 = ''
    49             for item_content in record_m:
    50                 record_content_c2a0 = item_content["content"]  #获取小说章节内容
    51             #record_content=record_content_c2a0.replace(u'xa0', u'')  #消除特殊字符xc2xa0
    52             record_content=record_content_c2a0
    53             #print(record_content)
    54             f.write('
    ')
    55             f.write(record_content + '
    ')
    56             f.write('
    
    ')
    57             url_ct = myset.find({"url": url_c},{"next_page":1,"by":1,"_id":0})  #获取下一章链接的查询对象
    58             for item_url in url_ct:
    59                 url_c = item_url["next_page"]  #下一章链接地址赋值给url_c,准备下一次循环。
    60         f.close()
    61         print(time.time()-start_time)
    62         print(txtname + ".txt" + " 文件已生成!")
    63         return
    64 
    65     #爬虫结束,调用content2txt方法,生成txt文件
    66     def close_spider(self,spider):
    67         self.content2txt(self.name_novel,self.url_firstchapter,self.name_txt)
    68         return

    2、修改爬虫程序

    (python) [root@localhost xbiquge_w]# vi xbiquge/spiders/sancun.py

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from xbiquge.items import XbiqugeItem
     4 from xbiquge.pipelines import XbiqugePipeline
     5 
     6 class SancunSpider(scrapy.Spider):
     7     name = 'sancun'
     8     allowed_domains = ['www.xbiquge.la']
     9     #start_urls = ['http://www.xbiquge.la/10/10489/']
    10     url_ori= "http://www.xbiquge.la"
    11     url_firstchapter = "http://www.xbiquge.la/10/10489/4534454.html"
    12     name_txt = "./novels/三寸人间"
    13 
    14     pipeline=XbiqugePipeline()
    15     pipeline.clearcollection(name) #清空小说的数据集合(collection),mongodb的collection相当于mysql的数据表table
    16     item = XbiqugeItem()
    17     item['id'] = 0         #新增id字段,便于查询
    18     item['name'] = name
    19     item['url_firstchapter'] = url_firstchapter
    20     item['name_txt'] = name_txt
    21 
    22     def start_requests(self):
    23         start_urls = ['http://www.xbiquge.la/10/10489/']
    24         for url in start_urls:
    25             yield scrapy.Request(url=url, callback=self.parse)
    26 
    27     def parse(self, response):
    28         dl = response.css('#list dl dd')     #提取章节链接相关信息
    29         for dd in dl:
    30             self.url_c = self.url_ori + dd.css('a::attr(href)').extract()[0]   #组合形成小说的各章节链接
    31             #print(self.url_c)
    32             #yield scrapy.Request(self.url_c, callback=self.parse_c,dont_filter=True)
    33             yield scrapy.Request(self.url_c, callback=self.parse_c)    #以生成器模式(yield)调用parse_c方法获得各章节链接、上一页链接、下一页链接和章节内容信息。
    34             #print(self.url_c)
    35     def parse_c(self, response):
    36         #item = XbiqugeItem()
    37         #item['name'] = self.name
    38         #item['url_firstchapter'] = self.url_firstchapter
    39         #item['name_txt'] = self.name_txt
    40         self.item['id'] += 1
    41         self.item['url'] = response.url
    42         self.item['preview_page'] = self.url_ori + response.css('div .bottem1 a::attr(href)').extract()[1]
    43         self.item['next_page'] = self.url_ori + response.css('div .bottem1 a::attr(href)').extract()[3]
    44         title = response.css('.con_top::text').extract()[4]
    45         contents = response.css('#content::text').extract()
    46         text=''
    47         for content in contents:
    48             text = text + content
    49         #print(text)
    50         self.item['content'] = title + "
    " + text.replace('15', '
    ')     #各章节标题和内容组合成content数据,15是^M的八进制表示,需要替换为换行符。
    51         yield self.item     #以生成器模式(yield)输出Item对象的内容给pipelines模块。

    4、修改items.py

    (python) [root@DL xbiquge_w]# vi xbiquge/items.py

     1 # -*- coding: utf-8 -*-
     2 
     3 # Define here the models for your scraped items
     4 #
     5 # See documentation in:
     6 # https://docs.scrapy.org/en/latest/topics/items.html
     7 
     8 import scrapy
     9 
    10 
    11 class XbiqugeItem(scrapy.Item):
    12     # define the fields for your item here like:
    13     # name = scrapy.Field()
    14     id = scrapy.Field()
    15     name = scrapy.Field()
    16     url_firstchapter = scrapy.Field()
    17     name_txt = scrapy.Field()
    18     url = scrapy.Field()
    19     preview_page = scrapy.Field()
    20     next_page = scrapy.Field()
    21     content = scrapy.Field()

    三、小结

    mongodb作爬虫存储比mysql更简洁。

  • 相关阅读:
    Leetcode 538. Convert BST to Greater Tree
    Leetcode 530. Minimum Absolute Difference in BST
    Leetcode 501. Find Mode in Binary Search Tree
    Leetcode 437. Path Sum III
    Leetcode 404. Sum of Left Leaves
    Leetcode 257. Binary Tree Paths
    Leetcode 235. Lowest Common Ancestor of a Binary Search Tree
    Leetcode 226. Invert Binary Tree
    Leetcode 112. Path Sum
    Leetcode 111. Minimum Depth of Binary Tree
  • 原文地址:https://www.cnblogs.com/sfccl/p/13827422.html
Copyright © 2011-2022 走看看