zoukankan html css js c++ java

Scrapy

安装
Scrapy 架构
配置文件，目录介绍
爬取数据并解析
持久化
scrapy 请求传参
提高爬虫效率

Scrapy

Scrapy 是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中

安装

mac,linux 平台：pip3 install scrapy

windows 平台：pip3 install scrapy

如果失败

1 pip3 install wheel #安装后，便支持通过wheel文件安装软件，wheel文件官网：https://www.lfd.uci.edu/~gohlke/pythonlibs

2 pip3 install lxml

3 pip3 install pyopenssl

4 下载并安装 pywin32

5 下载 twisted 的 wheel 文件

6 执行 pip3 install 下载目录Twisted-17.9.0-cp36-cp36m-win_amd64.whl

7 pip3 install scrapy

在 script 文件夹下会有 scrapy.exe 可执行文件

创建 scrapy 项目：scrapy startproject 项目名 (django创建项目)

创建爬虫：scrapy genspider 爬虫名要爬取的网站地址 # 可以创建多个爬虫

scrapy genspider SP www.xxx.com

启动爬虫

scrapy crawl 爬虫名字

scrapy crawl 爬虫名字 --nolog

不在命令行下执行爬虫

在项目路径下创建一个 main.py,右键执行即可
from scrapy.cmdline import execute
execute(['chouti'])

Scrapy 架构

# 引擎(EGINE)（大总管）
引擎负责控制系统所有组件之间的数据流，并在某些动作发生时触发事件。有关详细信息，请参见上面的数据流部分。
# 调度器(SCHEDULER)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
# 下载器(DOWLOADER)
用于下载网页内容, 并将网页内容返回给EGINE，下载器是建立在twisted这个高效的异步模型上的
# 爬虫(SPIDERS)
SPIDERS是开发人员自定义的类，用来解析responses，并且提取items，或者发送新的请求
# 项目管道(ITEM PIPLINES)
在items被提取后负责处理它们，主要包括清理、验证、持久化（比如存到数据库）等操作


# 两个中间件
-爬虫中间件
-下载中间件（用的最多，加头，加代理，加cookie，集成selenium）

配置文件，目录介绍

目录结构

-crawl_chouti   # 项目名
  -crawl_chouti # 跟项目一个名，文件夹
    -spiders    # spiders：放着爬虫  genspider生成的爬虫，都放在这下面
    	-__init__.py
      -chouti.py # 抽屉爬虫
      -cnblogs.py # cnblogs 爬虫
    -items.py     # 对比django中的models.py文件 ,写一个个的模型类
    -middlewares.py  # 中间件（爬虫中间件，下载中间件），中间件写在这
    -pipelines.py   # 写持久化的地方（持久化到文件，mysql，redis，mongodb）
    -settings.py    # 配置文件
  -scrapy.cfg       # 不用关注，上线相关的

配置文件

ROBOTSTXT_OBEY = False   # 是否遵循爬虫协议，强行运行
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'    # 请求头中的ua
LOG_LEVEL='ERROR' # 这样配置，程序错误信息才会打印，
	#启动爬虫直接 scrapy crawl 爬虫名   就没有日志输出
  # scrapy crawl 爬虫名 --nolog

爬虫文件

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'   # 爬虫名字
    allowed_domains = ['https://dig.chouti.com/']  # 允许爬取的域
    start_urls = ['https://dig.chouti.com/']   # 起始爬取的位置，爬虫一启动，会先向它发请求

    def parse(self, response):  # 解析，请求回来，自动执行parser，在这个方法中做解析
        print('---------------------------',response)

打印结果

<200 https://dig.chouti.com/>

爬取数据并解析

# 所有用css或者xpath选择出来的都放在列表中
# 取第一个:extract_first()
# 取出所有:extract()

# 内置的解析器
# response.css  
# response.xpath

# css 选择器取文本和属性
# .link-title::text
# .link-title::attr(href)

# xpath 选择器取文本和属性
# .//a[contains(@class,"link-title")/text()]
# .//a[contains(@class,"link-title")/@href]

chouti.py

# -*- coding: utf-8 -*-
import scrapy
from ..items import TttItem

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'  # 爬虫名字
    # allowed_domains = ['https://dig.chouti.com/']
    start_urls = ['https://dig.chouti.com/']

    # def parse(self, response):  # 解析，请求回来，自动执行parser，在这个方法中做解析
    #     print(response)

        # css 选择器
        # div_list = response.css('div.link-item')
        # for div in div_list:
        #     title=div.css('.link-title::text').extract_first()
        #     url=div.css('.link-title::attr(href)').extract_first()
        #     img_url=div.css('.matching::attr(src)').extract_first()
        #     print('''
        #     新闻标题：%s
        #     新闻连接：%s
        #     新闻图片：%s
        #     '''%(title,url,img_url))

    def parse(self, response):  # 解析，请求回来，自动执行parser，在这个方法中做解析
        div_list = response.xpath('//div[contains(@class,"link-item")]')
        for div in div_list:
            title=div.xpath('.//a[contains(@class,"link-title")]/text()').extract_first()
            url=div.xpath('.//a[contains(@class,"link-title")]/@href').extract_first()
            img_url=div.xpath('.//*[contains(@class,"matching")]/@src').extract_first()
            id=div.xpath('.//a[contains(@class,"link-title")]/@data-id').extract_first()
            print('''
            新闻标题：%s
            新闻连接：%s
            新闻图片：%s
            '''%(title,url,img_url))

持久化

方式一（了解）

1 parser解析函数，return 列表，列表套字典
2 scrapy crawl chouti -o aa.json   (支持：('json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle')

方式二（管道）

1 在items.py中创建模型类

2 在爬虫中chouti.py，引入，把解析的数据放到item对象中（要用中括号）
from ..items import TttItem

3 yield item对象

4 配置文件配置管道
       ITEM_PIPELINES = {
        # 数字表示优先级（数字越小，优先级越大）
       'crawl_chouti.pipelines.CrawlChoutiPipeline': 300,
       'crawl_chouti.pipelines.CrawlChoutiRedisPipeline': 301，
    	}
  -5 pipline.py中写持久化的类
        	-spider_open
          -spider_close
          -process_item（在这写保存到哪）

chouti.py

存文件

# -*- coding: utf-8 -*-
import scrapy
from ..items import TttItem

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'  # 爬虫名字
    # allowed_domains = ['https://dig.chouti.com/']
    start_urls = ['https://dig.chouti.com/']

    # def parse(self, response):  # 解析，请求回来，自动执行parser，在这个方法中做解析
    #     print(response)

        # css 选择器
        # div_list = response.css('div.link-item')
        # for div in div_list:
        #     title=div.css('.link-title::text').extract_first()
        #     url=div.css('.link-title::attr(href)').extract_first()
        #     img_url=div.css('.matching::attr(src)').extract_first()
        #     print('''
        #     新闻标题：%s
        #     新闻连接：%s
        #     新闻图片：%s
        #     '''%(title,url,img_url))

        
        
    def parse(self, response):  # 解析，请求回来，自动执行parser，在这个方法中做解析		
        # xpath 选择器
        div_list = response.xpath('//div[contains(@class,"link-item")]')
        for div in div_list:
            title=div.xpath('.//a[contains(@class,"link-title")]/text()').extract_first()
            url=div.xpath('.//a[contains(@class,"link-title")]/@href').extract_first()
            img_url=div.xpath('.//*[contains(@class,"matching")]/@src').extract_first()
            id=div.xpath('.//a[contains(@class,"link-title")]/@data-id').extract_first()

            item = TttItem()
            item['id'] = id
            item['title'] = title
            item['url'] = url
            item['img_url'] = img_url

            yield item

item.py 注意写模型类

import scrapy


class TttItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    id=scrapy.Field()
    title = scrapy.Field()
    url = scrapy.Field()
    img_url = scrapy.Field()

pipeline.py

# 存文件
class TttPipeline(object):
    def open_spider(self, spider):
        self.f = open('a.txt', 'w')

    
    def process_item(self, item, spider):
        with open('a.txt', 'w') as f:
            self.f.write(item['title'])
            self.f.write(item['url'])
            self.f.write(item['img_url'])
            self.f.write('
')
        return item

    def close_spider(self, spider):
        self.f.close()


        
# 存 redis
from redis import Redis
import json
class ChoutiRedisPipeline(object):
    def open_spider(self, spider):
        self.conn = Redis(password='redis123')

    def close_spider(self, spider):
        pass

    def process_item(self, item, spider):
        s = json.dumps({'title': item['title'],
                        'url': item['url'],
                        'img_url': item['img_url']})
        self.conn.hset('chouti', item['id'], s)
        return item

scrapy 请求传参

1 放 ：yield Request(url,callback=self.parser_detail,meta={'item':item})
取：response.meta.get('item')

案例

提高爬虫效率

在配置文件中进行相关的配置即可:(默认还有一套 setting)

1 增加并发：
默认 scrapy 开启的并发线程为 32 个，可以适当进行增加。在 settings 配置文件中修改CONCURRENT_REQUESTS = 100 值为 100,并发设置成了为 100。

2 提高日志级别：
在运行 scrapy 时，会有大量日志信息的输出，为了减少 CPU 的使用率。可以设置 log 输出信息为 INFO 或者 ERROR 即可。在配置文件中编写：LOG_LEVEL = ‘INFO’

3 禁止 cookie：
如果不是真的需要 cookie，则在 scrapy 爬取数据时可以禁止 cookie 从而减少 CPU 的使用率，提升爬取效率。在配置文件中编写：COOKIES_ENABLED = False

4 禁止重试：
对失败的 HTTP 进行重新请求（重试）会减慢爬取速度，因此可以禁止重试。在配置文件中编写：RETRY_ENABLED = False

5 减少下载超时：
如果对一个非常慢的链接进行爬取，减少下载超时可以能让卡住的链接快速被放弃，从而提升效率。在配置文件中进行编写：DOWNLOAD_TIMEOUT = 10 超时时间为10s

查看全文

相关阅读:
正则中的顺序环视和逆序环视
 LeetCode 第 27 场双周赛
 LeetCode 每日一题 198. 打家劫舍
 LeetCode 每日一题 974. 和可被 K 整除的子数组
 LeetCode 每日一题 287. 寻找重复数
 LeetCode 每日一题 4. 寻找两个正序数组的中位数
 LeetCode 每日一题 146. LRU缓存机制
 LeetCode 每日一题 105. 从前序与中序遍历序列构造二叉树
 [转]多线程的那点儿事
 LeetCode 每日一题 5. 最长回文子串

原文地址：https://www.cnblogs.com/kai-/p/12674512.html