zoukankan      html  css  js  c++  java
  • Scrapy全站数据爬取

    Scrapy安装

    • Linux
    1. pip install scrapy
    • Windows
    1. pip install wheel
    2. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    3. 进入第二步下载文件目录,pip install 下载的文件名
    4. pip install pywin2
    5. pip install scrapy

    创建项目、爬虫文件

    • 新建项目

      scrapy startproject crawPro

    • 新建爬虫文件

      进入爬虫项目目录 cd crawPro

      scrapy genspider -t craw5i5j www.xxx.com   #www.xxx.com为起始url,后面爬虫文件会注释掉

     编写爬虫文件

    • 注释起始url
    • 因爬虫地址第一页和第二页url不一致,所以新增加一个规则解析器,规则解析器使用正则表达式
    • 参数:follow = True表示跟随。就会渠道全部页码
    • 参数:callback='parse_item',回调函数表示每一个URL的返回数据,都需要parse_item方法进行解析
    • 参数:response返回的数据类型为select类型,取值需要使用extract_first()表示取第一个值,extract()取多个值,返回数据类型为list
    • 文件:items中需要定义参数,见代码
    • 导入item :from crawPro.items import CrawproItem;实例化item = CrawproItem(),把解析的值装入item
    • 最后需要:yield item ,把值传入items
    • craw5i5j.py代码
    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor   # 链接提取器
    from scrapy.spiders import CrawlSpider, Rule    # Rule规则解析器对象
    from crawPro.items import CrawproItem
    
    class Craw5i5jSpider(CrawlSpider):
        name = 'craw5i5j'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://nj.5i5j.com/xiaoqu/pukouqu/']
        # 链接提取器:前提follow=False,作用就是用来提取起始URL对应页面中符合要求的链接
        # 参数 allow是一个正在表达式。
        link = LinkExtractor(allow=r'^https://nj.5i5j.com/xiaoqu/pukouqu/nd+/$')
        link1 = LinkExtractor(allow=r'^https://nj.5i5j.com/xiaoqu/pukouqu/$')
    
        rules = (
            # 规则解析器对象 LinkExtractor实例化链接提取器、callback回调函数
            Rule(link, callback='parse_item', follow=True),
            Rule(link1, callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
    
            for li in response.xpath('//div[@class="list-con-box"]/ul/li'):
                xq_name = li.xpath(".//h3[@class='listTit']/a/text()").extract_first().strip()
                xq_chengjiao = li.xpath(".//div[@class='listX']/p/span[1]/a/text()").extract_first().strip()
                xq_danjia = li.xpath(".//div[@class='listX']/div[@class='jia']/p[@class='redC']//text()").extract_first().strip()
                xq_zongjia =li.xpath(".//div[@class='listX']/div[@class='jia']/p[2]/text()").extract_first().strip()
    
                item = CrawproItem()
                item['xq_name'] = xq_name
                item['xq_chengjiao'] = xq_chengjiao
                item['xq_danjia'] = xq_danjia
                item['xq_zongjia'] = xq_zongjia
    
                yield item
    View Code
    • items.py代码
    import scrapy
    
    class CrawproItem(scrapy.Item):
        # define the fields for your item here like:
        xq_name = scrapy.Field()
        xq_chengjiao = scrapy.Field()
        xq_danjia = scrapy.Field()
        xq_zongjia = scrapy.Field()
    View Code

      编写管道文件

    • 重写父类方法:def open_spider(self,spider):作用是打开文件一次,避免文件多次打开(若是数据库该方法应该是打开数据连接)
    • 重写父类方法:def close_spider(self,spider):把第一步打开的文件,进行关闭(数据库应该为关闭数据库连接)
    • 方法:process_item,设置写入文件的格式(写入数据库的操作)
    • 代码  
     1 class CrawproPipeline(object):
     2     fp = None
     3     # 重写父类,该方法只会被调用一次,打开文件。
     4     def open_spider(self,spider):
     5         self.fp = open("1.txt",'w',encoding='utf-8')
     6 
     7     def process_item(self, item, spider):
     8         self.fp.write(item["xq_name"]+"	"+item["xq_chengjiao"]+"	"+item["xq_danjia"]+"	"+item["xq_zongjia"]+"
    ")
     9         return item
    10 
    11     def close_spider(self,spider):
    12         self.fp.close()
    View Code

     设置中间件

    • 设置UA代理池
    • 中间件:middlewares.py,找到方法 :def process_request(self, request, spider):设置UA代理池
     1     def process_request(self, request, spider):
     2         # Called for each request that goes through the downloader
     3         # middleware.
     4 
     5         # Must either:
     6         # - return None: continue processing this request
     7         # - or return a Response object
     8         # - or return a Request object
     9         # - or raise IgnoreRequest: process_exception() methods of
    10         #   installed downloader middleware will be called
    11         user_agents = [
    12             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
    13             'Opera/8.0 (Windows NT 5.1; U; en)',
    14             'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
    15             'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
    16             'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
    17             'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
    18             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
    19             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
    20             'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    21             'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
    22             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
    23             'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    24             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
    25             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
    26             'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
    27             'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
    28             'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
    29         ]
    30         request.headers['USER-Agent'] =random.choice(user_agents)
    31         # print(request.headers)
    32         return None
    View Code

     设置配置文件

    • ROBOTSTXT_OBEY = False  设置为False 不遵守robot协议
    • 设置用户表示USER_AGENT:USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
    • 设置等待时间,看实际情况设置避免爬取太快:DOWNLOAD_DELAY = 3
    • 打开中间件
    DOWNLOADER_MIDDLEWARES = {
       'crawPro.middlewares.CrawproDownloaderMiddleware': 543,
    }
    View Code
    •  打开管道
    1 ITEM_PIPELINES = {
    2    'crawPro.pipelines.CrawproPipeline': 300,
    3 }
    View Code

     执行爬虫文件

    • 命令行中执行:scrapy crawl craw5i5j --nolog(不打印日志)
    • 命令行中执行:scrapy crawl craw5i5j(打印日志)
    • 执行文件后,查看是否有文件产生、或者数据库中是否有数据
  • 相关阅读:
    在VMware 虚拟机中彻底删除linux系统
    Linux中安装MySQL5.7和查看启动状态
    VMware启动时提示我已移动或我已复制该虚拟机
    Linux中查看MySQL版本启动默认安装位置
    linux 下查看redis是否启动和启动命令
    Linux中查看redis版本
    maven下载依赖失败解决方案
    《痞子衡嵌入式半月刊》 第 27 期
    痞子衡嵌入式:盘点国内车规级MCU厂商
    痞子衡嵌入式:盘点国内Cortex-M内核MCU厂商高性能产品
  • 原文地址:https://www.cnblogs.com/ygzy/p/11488206.html
Copyright © 2011-2022 走看看