scrapy是一个大而全的爬虫组件
安装:
- Win: 下载:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted pip3 install wheel pip install Twisted‑18.4.0‑cp36‑cp36m‑win_amd64.whl pip3 install pywin32 pip3 install scrapy - Linux: pip3 install scrapy
使用
# 创建project scrapy startproject xdb cd xdb # 创建爬虫 scrapy genspider chouti chouti.com scrapy genspider cnblogs cnblogs.com # 启动爬虫 scrapy crawl chouti
流程
1. 创建project scrapy startproject 项目名称 项目名称 项目名称/ - spiders # 爬虫文件 - chouti.py - cnblgos.py .... - items.py # 持久化 - pipelines # 持久化 - middlewares.py # 中间件 - settings.py # 配置文件(爬虫) scrapy.cfg # 配置文件(部署) 2. 创建爬虫 cd 项目名称 scrapy genspider chouti chouti.com scrapy genspider cnblgos cnblgos.com 3. 启动爬虫 scrapy crawl chouti scrapy crawl chouti --nolog
具体操作爬取chouti
# -*- coding: utf-8 -*- import scrapy from scrapy.http.response.html import HtmlResponse from scrapy.http import Request import sys,os,io sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] #定向爬虫,只爬这一个网站 start_urls = ['http://chouti.com/']#起始url def parse(self, response): #回调函数 f = open('news.log', mode='a+') item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]') # print(item_list) for item in item_list: text = item.xpath('.//a/text()').extract_first() href = item.xpath('.//a/@href').extract_first() print(href.strip()) print(text.strip()) f.write(href + ' ') f.close() page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract() for page in page_list: from scrapy.http import Request page = "https://dig.chouti.com" + page yield Request(url=page, callback=self.parse) # https://dig.chouti.com/all/hot/recent/2