zoukankan      html  css  js  c++  java
  • 爬虫之scrapy框架

    概述

    scrapy是为了爬取网站数据提取数据而写的框架,内置了多功能,通用性强,容易学习的一个爬虫框架

    安装scrapy

    pip install scrapy -i https://pypi.douban.com/simple

    在window中安装scrapy需要安装twisted和pywin32,安装twisted,在http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted下载,然后通过cmd进入目录输入命令pip install Twisted-19.2.1-cp36-cp36m-win_amd64.whl安装,安装pywin32之间pip install pywin32 -i https://pypi.couban.com/simple

    使用

    创建项目:通过命令行输入scrapy startproject 项目名

     创建普通爬虫文件:cd到项目中scrapy genspider 爬虫文件名  初始url(随便设置,后面可以自己在爬虫文件中设置)

     启动普通爬虫文件项目:scrapy crawl 文件名

     启动项目是有可选参数  --nolog是取消查看日志信息, -o 文件名  是将结果输出到自定义文件名中,一般情况下不用

    项目创建有以下目录所示

    └─myproject
        │  items.py                # 定义提交到管道的属性
        │  middlewares.py      # 中间件文件
        │  pipelines.py           # 管道文件
        │  settings.py            # 配置文件__init__.py
        │
        ├─spiders                  # 爬虫文件夹
        │  │  project.py         # 自定义创建的爬虫文件
        │  │  __init__.py

    各个文件的使用

    settings.py(之挑选了经常使用的配置)

    # UA伪装字段
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
    
    # 日志显示配置
    LOG_LEVEL = 'INFO'
    
    # robots协议配置
    ROBOTSTXT_OBEY = False
    
    #线程数
    CONCURRENT_REQUESTS = 32
    
    #开启中间件
    DOWNLOADER_MIDDLEWARES = {
       'wangyi.middlewares.WangyiDownloaderMiddleware': 543,
    }
    
    #管道开启
    ITEM_PIPELINES = {
       'wangyi.pipelines.WangyiPipeline': 300,
    }

    爬虫文件

    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class ProjectSpider(scrapy.Spider):
        name = 'project'
        # 允许爬虫的网站,一般情况下注释
        # allowed_domains = ['www.xxx.com']
        # 看是爬虫的初始url
        start_urls = ['http://www.xxx.com/']
    
        def parse(self, response):
            '''
            访问url后的回调函数
            :param response: 响应对象
            :return:
            '''
            pass

    items.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class MyprojectItem(scrapy.Item):
        '''
        提交管道字段,用scrapy.Field()定义即可
        '''
        # define the fields for your item here like:
        # name = scrapy.Field()
        pass

    middlewares.py

    由于在中间件中有自动生成的两个类,这里只介绍其中一个常用的类,且只介绍里面常用的方法常用的

    # -*- coding: utf-8 -*-
    
    # Define here the models for your spider middleware
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    
    from scrapy import signals
    
    # 一般情况下不适用该各类
    class MyprojectSpiderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the spider middleware does not modify the
        # passed objects.
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        def process_spider_input(self, response, spider):
            # Called for each response that goes through the spider
            # middleware and into the spider.
    
            # Should return None or raise an exception.
            return None
    
        def process_spider_output(self, response, result, spider):
            # Called with the results returned from the Spider, after
            # it has processed the response.
    
            # Must return an iterable of Request, dict or Item objects.
            for i in result:
                yield i
    
        def process_spider_exception(self, response, exception, spider):
            # Called when a spider or process_spider_input() method
            # (from other spider middleware) raises an exception.
    
            # Should return either None or an iterable of Request, dict
            # or Item objects.
            pass
    
        def process_start_requests(self, start_requests, spider):
            # Called with the start requests of the spider, and works
            # similarly to the process_spider_output() method, except
            # that it doesn’t have a response associated.
    
            # Must return only requests (not items).
            for r in start_requests:
                yield r
    
        def spider_opened(self, spider):
            spider.logger.info('Spider opened: %s' % spider.name)
    
    
    class MyprojectDownloaderMiddleware(object):
    
        def process_request(self, request, spider):
            '''
            拦截所用能够正常访问的请求,即请求前经过治理
            :param request: 请求对象
            :param spider: 爬虫文件中的类实例化的对象,可以调用其中的属性和方法
            :return:
            '''
            return None
    
        def process_response(self, request, response, spider):
            '''
            拦截所有响应,可以在这类对响应对象进行处理
            :param request: 请求对象
            :param response: 响应对象
            :param spider: 爬虫文件中的类实例化的对象,可以调用其中的属性和方法
            :return:
            '''
            return response
    
        def process_exception(self, request, exception, spider):
            '''
            拦截所用能够异常访问的请求,即请求前经过治理
            :param request: 请求对象
            :param spider: 爬虫文件中的类实例化的对象,可以调用其中的属性和方法
            :return:
            '''
            pass
    View Code

    pipelines.py

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class MyprojectPipeline(object):
        def process_item(self, item, spider):
            # 计较过来的数据通过item接受,可以理解为字典,字典的键是items.py种定义的字段,值是爬虫文件里提交过来的数据
            return item

     

     

  • 相关阅读:
    小组最终答辩
    机器学习的安全隐私
    关于Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks的理解
    第十六讲-对抗样本与对抗训练3
    对抗样本机器学习_Note1_机器学习
    对抗样本机器学习_cleverhans_FGSM/JSMA
    实验四:Tensorflow实现了四个对抗图像制作算法--readme
    实验一拓展文献阅读—反向传播计算图上的微积分
    tf.placeholder 与 tf.Variable
    Robust Adversarial Examples_鲁棒的对抗样本
  • 原文地址:https://www.cnblogs.com/mark--ping/p/11773864.html
Copyright © 2011-2022 走看看