Scrapy爬虫案例01——翻页爬取

zoukankan html css js c++ java

Scrapy爬虫案例01——翻页爬取
　　之前用python写爬虫，都是自己用requests库请求，beautifulsoup（pyquery、lxml等）解析。没有用过高大上的框架。早就听说过Scrapy，一直想研究一下。下面记录一下我学习使用Scrapy的系列代码及笔记。

安装

　　Scrapy的安装很简单，官方文档也有详细的说明 http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/install.html 。这里不详细说明了。

创建工程

　　我是用的是pycharm开发，打开pycharm，然后在下面的“Terminal”中输入命令“scrapy startproject freebuf”。这句话是在你的工作空间中创建一个叫“freebuf”的scrapy工程。如下图：

上图中，因为我的工作空间中已经存在“freebuf”所以第一次创建失败，这里我创建的名字为"freebuf2"，创建成功。freebuf2的目录及说明如下：

编写爬虫

freebuf2Spider.py

　　选中“spiders”文件夹，右键“NEW”->"Python File"，输入文件名“freebuf2Spider”,添加代码。如下图所示。
#coding:utf-8 import scrapy from freebuf2.items import Freebuf2Item import time from scrapy.crawler import CrawlerProcess class freebuf2Spider(scrapy.Spider): name ='freebuf2' allowed_domains = [] start_urls = ["http://www.freebuf.com/"] def parse(self, response): for link in response.xpath("//div[contains(@class, 'news_inner news-list')]/div/a/@href").extract(): yield scrapy.Request(link, callback=self.parse_next)#这里不好理解的朋友，先去看看yield的用法。我是按协程（就是中断执行）理解的，感觉容易理解。 next_url = response.xpath("//div[@class='news-more']/a/@href").extract()#找到下一个链接，也就是翻页。 if next_url: yield scrapy.Request(next_url[0],callback=self.parse) def parse_next(self,response): item = Freebuf2Item() item['title'] = response.xpath("//h2/text()").extract() item['url'] = response.url item['date'] = response.xpath("//div[@class='property']/span[@class='time']/text()").extract() item['tags'] = response.xpath("//span[@class='tags']/a/text()").extract() yield item
item.py

　　itmes对象是种简单的容器，你可以理解为dict,保存了爬取到得数据。代码如下：
import scrapy class Freebuf2Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() date = scrapy.Field() url = scrapy.Field() tags = scrapy.Field()
学做优雅的爬虫

　　上面代码已经完成了一个简单的翻页爬虫，但是为了做一个优雅的爬虫。我们还需要对其设置访问间隔时间，在settings.py中添加“DOWNLOAD_DELAY = 3”。意思是，每3秒请求一次。

好了，大功告成。在pycharm中的“Terminal”（cmd也可以哈），切换倒freebuf2工程目录下（就是第一个freebuf2文件夹），输入命令“scrapy crawl freebuf2 -o freebuf2.csv”。就可以运行了。如果想停止，直接输入“shutdown”就可以了。最后看看数据吧。

数据：

　　
查看全文

相关阅读:
mysql的安全漏洞的一种现象，就是利用转义字符把 ' ' 化没了，然后true 起作用啦
 maven项目中添加MySql依赖失败(以及maven的安装到maven项目的使用过程)
mysql中的update(更新)与alter(更改)以及 change和modify的区别
 多线程：（充分利用定义任务后，开启多线程实现任务的理解）题目：模拟三个老师同时给50个小朋友发苹果，每个老师相当于一个线程。
swing更改组件（删除后添加）得到心得：起码得刷新一下啊，可能还得再考虑重绘
 IE设置主页一直无果，尝试了右键软件看目标路径后缀无效，注册表也无效，最后在电脑管家里的工具浏览器保护搞定
 封装的localstorge的插件，store.js
jquery.cookie用法详细解析，封装的操作cookie的库有jquery.cookie.js
localstroge可以在页面间传递数值；
移动开发阻止默认事件，1默认长按复制2拖动时页面默认移动

原文地址：https://www.cnblogs.com/bluesky-ivy/p/6203603.html

Scrapy爬虫案例01——翻页爬取

安装

创建工程

编写爬虫

freebuf2Spider.py

学做优雅的爬虫