一、背景:学习mongodb,考虑把原使用mysql作scrapy爬小说存储的程序修改为使用mongodb作存储。
二、过程:
1、安装mongodb
(1)配置yum repo
(python) [root@DL ~]# vi /etc/yum.repos.d/mongodb-org-4.0.repo
[mngodb-org]
name=MongoDB Repository
baseurl=http://mirrors.aliyun.com/mongodb/yum/redhat/7Server/mongodb-org/4.0/x86_64/
gpgcheck=0
enabled=1
(2)yum安装
(python) [root@DL ~]# yum -y install mongodb-org
(3)启动mongod服务
(python) [root@DL ~]# systemctl start mongod
(4)进入mongodb的shell
(python) [root@DL ~]# mongo
MongoDB shell version v4.0.20
...
To enable free monitoring, run the following command: db.enableFreeMonitoring()
To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
---
>
(5)安装pymongo模块
(python) [root@DL ~]# pip install pymongo
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting pymongo
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/13/d0/819074b92295149e1c677836d72def88f90814d1efa02199370d8a70f7af/pymongo-3.11.0-cp38-cp38-manylinux2014_x86_64.whl (530kB)
|████████████████████████████████| 532kB 833kB/s
Installing collected packages: pymongo
Successfully installed pymongo-3.11.0
2、修改pipeline.py程序
(python) [root@localhost xbiquge_w]# vi xbiquge/pipelines.py
1 # -*- coding: utf-8 -*- 2 3 # Define your item pipelines here 4 # 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html 7 import os 8 import time 9 from twisted.enterprise import adbapi 10 from pymongo import MongoClient 11 12 class XbiqugePipeline(object): 13 conn = MongoClient('localhost',27017) 14 db = conn.novels #建立数据库novels的连接对象db 15 #name_novel = '' 16 17 #定义类初始化动作 18 #def __init__(self): 19 20 #爬虫开始 21 #def open_spider(self, spider): 22 23 #return 24 def clearcollection(self, name_collection): 25 myset = self.db[name_collection] 26 myset.remove() 27 28 def process_item(self, item, spider): 29 #if self.name_novel == '': 30 self.name_novel = item['name'] 31 self.url_firstchapter = item['url_firstchapter'] 32 self.name_txt = item['name_txt'] 33 34 exec('self.db.'+ self.name_novel + '.insert_one(dict(item))') 35 return item 36 37 #从数据库取小说章节内容写入txt文件 38 def content2txt(self,dbname,firsturl,txtname): 39 myset = self.db[dbname] 40 record_num = myset.find().count() #获取小说章节数量 41 print(record_num) 42 counts=record_num 43 url_c = firsturl 44 start_time=time.time() #获取提取小说内容程序运行的起始时间 45 f = open(txtname+".txt", mode='w', encoding='utf-8') #写方式打开小说名称加txt组成的文件 46 for i in range(counts): #括号中为counts 47 record_m = myset.find({"url": url_c},{"content":1,"by":1,"_id":0}) 48 record_content_c2a0 = '' 49 for item_content in record_m: 50 record_content_c2a0 = item_content["content"] #获取小说章节内容 51 #record_content=record_content_c2a0.replace(u'xa0', u'') #消除特殊字符xc2xa0 52 record_content=record_content_c2a0 53 #print(record_content) 54 f.write(' ') 55 f.write(record_content + ' ') 56 f.write(' ') 57 url_ct = myset.find({"url": url_c},{"next_page":1,"by":1,"_id":0}) #获取下一章链接的查询对象 58 for item_url in url_ct: 59 url_c = item_url["next_page"] #下一章链接地址赋值给url_c,准备下一次循环。 60 f.close() 61 print(time.time()-start_time) 62 print(txtname + ".txt" + " 文件已生成!") 63 return 64 65 #爬虫结束,调用content2txt方法,生成txt文件 66 def close_spider(self,spider): 67 self.content2txt(self.name_novel,self.url_firstchapter,self.name_txt) 68 return
2、修改爬虫程序
(python) [root@localhost xbiquge_w]# vi xbiquge/spiders/sancun.py
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from xbiquge.items import XbiqugeItem 4 from xbiquge.pipelines import XbiqugePipeline 5 6 class SancunSpider(scrapy.Spider): 7 name = 'sancun' 8 allowed_domains = ['www.xbiquge.la'] 9 #start_urls = ['http://www.xbiquge.la/10/10489/'] 10 url_ori= "http://www.xbiquge.la" 11 url_firstchapter = "http://www.xbiquge.la/10/10489/4534454.html" 12 name_txt = "./novels/三寸人间" 13 14 pipeline=XbiqugePipeline() 15 pipeline.clearcollection(name) #清空小说的数据集合(collection),mongodb的collection相当于mysql的数据表table 16 item = XbiqugeItem() 17 item['id'] = 0 #新增id字段,便于查询 18 item['name'] = name 19 item['url_firstchapter'] = url_firstchapter 20 item['name_txt'] = name_txt 21 22 def start_requests(self): 23 start_urls = ['http://www.xbiquge.la/10/10489/'] 24 for url in start_urls: 25 yield scrapy.Request(url=url, callback=self.parse) 26 27 def parse(self, response): 28 dl = response.css('#list dl dd') #提取章节链接相关信息 29 for dd in dl: 30 self.url_c = self.url_ori + dd.css('a::attr(href)').extract()[0] #组合形成小说的各章节链接 31 #print(self.url_c) 32 #yield scrapy.Request(self.url_c, callback=self.parse_c,dont_filter=True) 33 yield scrapy.Request(self.url_c, callback=self.parse_c) #以生成器模式(yield)调用parse_c方法获得各章节链接、上一页链接、下一页链接和章节内容信息。 34 #print(self.url_c) 35 def parse_c(self, response): 36 #item = XbiqugeItem() 37 #item['name'] = self.name 38 #item['url_firstchapter'] = self.url_firstchapter 39 #item['name_txt'] = self.name_txt 40 self.item['id'] += 1 41 self.item['url'] = response.url 42 self.item['preview_page'] = self.url_ori + response.css('div .bottem1 a::attr(href)').extract()[1] 43 self.item['next_page'] = self.url_ori + response.css('div .bottem1 a::attr(href)').extract()[3] 44 title = response.css('.con_top::text').extract()[4] 45 contents = response.css('#content::text').extract() 46 text='' 47 for content in contents: 48 text = text + content 49 #print(text) 50 self.item['content'] = title + " " + text.replace('15', ' ') #各章节标题和内容组合成content数据,15是^M的八进制表示,需要替换为换行符。 51 yield self.item #以生成器模式(yield)输出Item对象的内容给pipelines模块。
4、修改items.py
(python) [root@DL xbiquge_w]# vi xbiquge/items.py
1 # -*- coding: utf-8 -*- 2 3 # Define here the models for your scraped items 4 # 5 # See documentation in: 6 # https://docs.scrapy.org/en/latest/topics/items.html 7 8 import scrapy 9 10 11 class XbiqugeItem(scrapy.Item): 12 # define the fields for your item here like: 13 # name = scrapy.Field() 14 id = scrapy.Field() 15 name = scrapy.Field() 16 url_firstchapter = scrapy.Field() 17 name_txt = scrapy.Field() 18 url = scrapy.Field() 19 preview_page = scrapy.Field() 20 next_page = scrapy.Field() 21 content = scrapy.Field()
三、小结
mongodb作爬虫存储比mysql更简洁。