简介
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下:
Scrapy组件
l 引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心)
l 调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, <br data-filtered="filtered">由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
l 下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
l 爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
l 项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,<br data-filtered="filtered">将被发送到项目管道,并经过几个特定的次序处理数据。
l 下载器中间件(Downloader Middlewares)
位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
l 爬虫中间件(Spider Middlewares)
介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
l 调度中间件(Scheduler Middewares)
介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应
Scrapy运行流程
1) 引擎从调度器中取出一个链接(URL)用于接下来的抓取
2) 引擎把URL封装成一个请求(Request)传给下载器
3) 下载器把资源下载下来,并封装成应答包(Response)
4) 爬虫解析Response
5) 解析出实体(Item),则交给实体管道进行进一步的处理
6) 解析出的是链接(URL),则把URL交给调度器等待抓取
安装Scrapy
(1)依赖包
xml, an efficient XML and HTML parser
parsel, an HTML/XML data extraction library written on top of lxml,
w3lib, a multi-purpose helper for dealing with URLs and web page encodings
twisted, an asynchronous networking framework
cryptography and pyOpenSSL, to deal with various network-level security needs
(2)安装
1、安装wheel
pip install wheel
2、安装lxml
https://pypi.python.org/pypi/lxml/4.1.0
3、安装pyopenssl
https://pypi.python.org/pypi/pyOpenSSL/17.5.0
4、安装Twisted
https://www.lfd.uci.edu/~gohlke/pythonlibs/
5、安装pywin32
https://sourceforge.net/projects/pywin32/files/
6、安装scrapy
pip install scrapy
查看安装是否成功,输入命令
scrapy version
新建第一个项目
新建tutorial工程
scrapy startproject scrapytest
scrapytest/
scrapy.cfg # deploy configuration file
spiders/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
进入scrapytest目录,使用命令创建一个基础爬虫类
cd scrapytest
scrapy genspider Mycninfo cninfo.com.cn
编辑文件
【scrapytest/scrapytest/spiders/Mycninfo.py】
1 # -*- coding: utf-8 -*- 2 import scrapy 3 import json 4 import os 5 import time 6 7 # scrapy crawl Mycninfo 8 class MycninfoSpider(scrapy.Spider): 9 name = 'Mycninfo' 10 allowed_domains = ['cninfo.com.cn'] 11 start_urls = ['http://www.cninfo.com.cn/new/index'] 12 13 def parse(self, response): 14 keyWord='002304' 15 url='http://www.cninfo.com.cn/new/information/topSearch/query' 16 yield scrapy.FormRequest( 17 url = url, 18 formdata = {"keyWord" : keyWord, "maxNum" : "10"}, 19 callback = self.parse_url_info 20 ) 21 22 def parse_url_info(self,response): 23 try: 24 #print(response.text) 25 data = eval(response.text) 26 orgId = data[0].get("orgId") 27 code = data[0].get("code") 28 url = "http://www.cninfo.com.cn/new/disclosure/stock?stockCode={}&orgId={}".format(code,orgId) 29 yield scrapy.Request(url=url,meta={"code":code,"orgId":orgId},callback=self.parse_page,dont_filter=True) 30 except Exception as e: 31 print(e) 32 33 def parse_page(self,response): 34 try: 35 code = response.meta.get("code") 36 orgId = response.meta.get("orgId") 37 url="http://www.cninfo.com.cn/new/hisAnnouncement/query" 38 plate = "sh" if code.startswith("60") else "sz" 39 parm = {"sh":{"plate":"sh","column":"sse"},"sz":{"plate":"sz","column":"szse"}} 40 query={ 41 'stock': '{},{}'.format(code,orgId), 42 'tabName': 'fulltext', 43 'pageSize': '30', 44 'pageNum': '1', 45 'column': parm[plate]["column"], 46 'category': 'category_ndbg_szsh', 47 'plate': parm[plate]["plate"], 48 'seDate': '', 49 'searchkey': '', 50 'secid': '', 51 'sortName': '', 52 'sortType': '', 53 'isHLtitle': 'true' 54 } 55 #print(query) 56 yield scrapy.FormRequest( 57 url = url, 58 formdata = query, 59 callback = self.parse_ndbg 60 ) 61 except Exception as e: 62 print("e",e) 63 64 def parse_ndbg(self,response): 65 try: 66 data = response.text 67 data_json = json.loads(data) 68 for data in data_json.get('announcements',[]): 69 name = data.get("secName") 70 announcementId = data.get("announcementId") 71 announcementTitle = data.get("announcementTitle") 72 announcementTime = data.get("announcementTime") 73 t = time.localtime(int(announcementTime)//1000) 74 date = time.strftime("%Y-%m-%d",t) 75 url = "http://www.cninfo.com.cn/new/announcement/download?bulletinId={}&announceTime={}".format(announcementId,date) 76 #print("{}_{}".format(name,announcementTitle),url) 77 data = {"name":name,"announcementTitle":announcementTitle,"url":url} 78 yield data 79 except Exception as e: 80 print(e) 81 82
【scrapytest/scrapytest/pipelines.py】
1 #!/usr/bin/env python3 2 # -*- coding: utf-8 -*- 3 import json 4 from scrapy.pipelines.files import FilesPipeline 5 from scrapy.exceptions import DropItem 6 7 from crawler.items import * 8 from crawler.DB import MysqlDB 9 10 class DownloadPipeline(FilesPipeline): 11 #下载文件 12 def get_media_requests(self, item, info): 13 # meta里面的数据是从spider获取,然后通过meta传递给下面方法:file_path 14 yield scrapy.Request(url=item['url'],meta={'text':item['name'],'name':item['announcementTitle']}) 15 16 def item_completed(self, results, item, info): 17 # 是一个元组,第一个元素是布尔值表示是否成功 18 if not results[0][0]: 19 raise DropItem('下载失败') 20 return item 21 22 def file_path(self, request, response=None, info=None): 23 # 重命名,若不重写这函数,图片名为哈希 24 # 接收上面meta传递过来的图片名称 25 name = request.meta['name'] 26 text = request.meta['text'] 27 # 提取url前面名称作为图片名 28 # 分文件夹存储的关键:{0}对应着name;{1}对应着image_guid 29 filename = u'{0}/{1}.pdf'.format(text, name) 30 print(filename) 31 return filename
【scrapytest/scrapytest/settings.py】
1 # -*- coding: utf-8 -*- 2 3 BOT_NAME = 'crawler' 4 5 SPIDER_MODULES = ['crawler.spiders'] 6 NEWSPIDER_MODULE = 'crawler.spiders' 7 LOG_LEVEL= 'ERROR' 8 9 # Crawl responsibly by identifying yourself (and your website) on the user-agent 10 #USER_AGENT = 'crawler (+http://www.yourdomain.com)' 11 12 # Obey robots.txt rules 13 14 15 DOWNLOAD_DELAY = 1 16 ROBOTSTXT_OBEY = False 17 DEFAULT_REQUEST_HEADERS = { 18 #'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 19 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 20 'Connection':'keep-alive', 21 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', 22 'Accept-Language':'zh-CN,zh;q=0.9', 23 24 } 25 #下载文件存放的主目录 26 FILES_STORE = 'data' 27 ITEM_PIPELINES = { 28 'crawler.pipelines.DownloadPipeline': 300, 29 }
启动
cd scrapy
scrapy crawl Mycninfo