zoukankan      html  css  js  c++  java
  • 爬虫框架scrapy(1)持久化存储的多种方式及多页爬取数据

    Linux:pip3 install scrapy

    window:

      a:pip3 install wheel

      b:下载twisted高性能异步模块 https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 

      c:进入下载的目录 执行pip3 install Twisted-17.1---.whl

      d:pip3 install pywin32

      e:pip3 install scrapy

      f:pip3 install scrapy

    创建项目:scrapy startproject 项目名称

    创建爬虫文件:cd 项目 

           scrapy genspider 爬虫文件名 www.baidu.com

    启动爬虫文件:scray crawl 爬虫文件 --nolog

    框架简单理解:spiders文件下存放爬虫文件

           items文件存放永久化存储的属性字段,与管道配合使用

           middlewares中间件存放下载中间件与爬虫中间件

           管道做持久化存储,可以写多个管道文件

           settings文件做配置

    setting文件配置:

      添加User-Agent:USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'

      ROBOT.TXT改为false:ROBOTSTXT_OBEY = False

      如需持久化存储将管道文件注释打开,如有多个管道文件,要条件注册,并写好优先级

      如需中间件操作需要将中间件注释打开

    案例1 直聘网获取岗位信息 核心:多种持久化存储方式的管道配置

    class BoosSpider(scrapy.Spider):
        name = 'boos'
        # allowed_domains = ['www.baidu.com']
        start_urls = [
            'https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&scity=101010100&industry=&position=']
    
        def parse(self, response):
            li_list = response.xpath('//div[@class="job-list"]/ul/li')
            for li in li_list:
                title = li.xpath('.//div[@class="info-primary"]/h3/a/div[@class="job-title"]/text()').extract_first()
                price = li.xpath('.//div[@class="info-primary"]/h3/a/span/text()').extract_first()
                company = li.xpath('.//div[@class="info-company"]/div/h3/a/text()').extract_first()
    
                item = BoosproItem()
                item['title']=title
                item['price']=price
                item['company'] = company
                yield item  

      item配置

    class BoosproItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = scrapy.Field()
        price = scrapy.Field()
        company = scrapy.Field()
    

      管道配置

    import pymysql
    from redis import Redis
    import json
    class BoosproPipeline(object):
        fp = None
        def open_spider(self,spider):
            print("开始爬虫")
            self.fp = open('./job.txt','w',encoding='utf-8')
        def process_item(self, item, spider):
            self.fp.write(item['title']+'	'+item['price']+'	'+item['company']+'
    ')
            return item
        def close_spider(self,spider):
            print('爬虫结束!!!')
            self.fp.close()
    
    
    class MysqlPipeline(object):
        conn =None
        cursor =None
        def open_spider(self,spider):
            print("开始爬虫")
            self.conn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', password='321', db='pa')
        def process_item(self, item, spider):
            self.cursor = self.conn.cursor()
            sql='insert into job values ("%s","%s","%s")'%(item['title'],item['price'],item['company'])
            try:
                self.cursor.execute(sql)
                self.conn.commit()
            except Exception as e:
                print(e)
                self.conn.rollback()
            return item
        def close_spider(self,spider):
            print('爬虫结束!!!')
            self.cursor.close()
            self.conn.close()
    
    
    class ReidsPipeline(object):
        conn =None
        def open_spider(self,spider):
            print("开始爬虫")
            self.conn = Redis(host='127.0.0.1', port=6379 ,db=14)
        def process_item(self, item, spider):
            dic = {
                'title':item["title"],
                'price':item['price'],
                'company':item['company']
            }
            dic= json.dumps(dic,ensure_ascii=False)
            self.conn.lpush('jobinfo',dic)
            return item

    案例2 多页爬去网站信息

    爬虫文件

    import scrapy
    from chouti.items import ChoutiItem
    
    class CtSpider(scrapy.Spider):
        name = 'ct'
        # allowed_domains = ['www.baidu.com']
        url = 'https://dig.chouti.com/r/scoff/hot/%d'
        page_num =1
        start_urls = ['https://dig.chouti.com/r/scoff/hot/1']
    
        def parse(self, response):
            div_list = response.xpath('//div[@id="content-list"]/div')
            for div in div_list:
                head = div.xpath('./div[3]/div[1]/a/text()').extract_first()
                author = div.xpath('./div[3]/div[2]/a[4]/b/text()').extract_first()
    
                item = ChoutiItem()
                item['head'] = head
                item['author'] = author
    
                yield item
    
            if self.page_num<5:
                self.page_num+=1
                new_url = format(self.url%self.page_num)
                yield scrapy.Request(url=new_url, callback=self.parse)

    items配置

    import scrapy
    
    
    class ChoutiItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        head = scrapy.Field()
        author = scrapy.Field()
        pass

    管道配置

    class ChoutiproPipeline(object):
        def process_item(self, item, spider):
            print(item['head'],item['author'])
            return item
    

      

  • 相关阅读:
    贮油点问题(C++)
    二维数组操作
    二的幂次方(递归)
    [haoi2009]巧克力
    距离最远的牛
    ssh注解basedao简单的实现
    @service中构造方法报错
    spring注解 构造函数问题
    json与gson
    AsyncTask异步类的简单操作
  • 原文地址:https://www.cnblogs.com/wszxdzd/p/10269146.html
Copyright © 2011-2022 走看看