zoukankan      html  css  js  c++  java
  • 爬虫第三部分综合案例

    (1)第一步:内容介绍

    (2)

     网易新闻的链接:https://news.163.com/

    重点爬取五个板块的文字:国内,国际,军事,航空,无人机

    需求:爬取基于文字的新闻数据

    三步走

    第一步:

    新建项目
    scrapy startproject wangyiPro
    cd wangyiPro/

    建立爬虫文件
    scrapy genspider wangyi www.xxxx.com

    第二步:组织数据结构和写爬虫文件

    wangyi.py

    import scrapy
    
    
    class WangyiSpider(scrapy.Spider):
        name = 'wangyi'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://news.163.com/']
    
        def parse(self, response):
            lis=response.xpath('//div[@class="ns_area list"]/ul/li')
            indexs=[3,4,6,7,8]
            li_list=[] #存储的就是国内,国际,军事,航空,无人机五个模块对应的li标签对象
            for index in indexs:
                li_list.append(lis[index])
            #获取四个板块中的链接和文字标题
            for li in li_list:
                url=li.xpath('./a/@href').extract_first()
                title=li.xpath('./a/text()').extract_first()
                print(url+":"+title)

    在settings.py加上UA和robots设置

    注意,在爬取数据量很小的时候可以不加,在某些网站上,这个案例就是这样

    第三步:执行

    scrapy crawl wangyi --nolog

    结果:

    (3)很多时间花费在解析数据上了

    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class WangyiSpider(scrapy.Spider):
        name = 'wangyi'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://news.163.com/']
    
        def parse(self, response):
            lis=response.xpath('//div[@class="ns_area list"]/ul/li')
            indexs=[3,4,6,7,8]
            li_list=[] #存储的就是国内,国际,军事,航空,无人机五个模块对应的li标签对象
            for index in indexs:
                li_list.append(lis[index])
            #获取四个板块中的链接和文字标题
            for li in li_list:
                url=li.xpath('./a/@href').extract_first()
                title=li.xpath('./a/text()').extract_first()
                # print(url+":"+title)
    
                #对每一个板块对应的url发起请求,获取页面数据(标题,缩略图,关键字,发布时间,url)
                yield scrapy.Request(url=url,callback=self.parseSecond)
        def parseSecond(self,response):
            div_list=response.xpath('//div[@class="data_row news_article clearfix"]')
            print(len(div_list))
            for div in div_list:
                head=div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first()
                url=div.xpath('.//div[@class="news_title"]/h3/a/@href').extract_first()
                imgUrl=div.xpath('./a/img/@src').extract_first()
                publish_t=div.xpath('.//div[@class="news_tag"]/span/text()').extract_first()
                tag=div.xpath('.//div[@class="keywords"]/a/text()').extract()
                tag="".join(tag)

    动态加载的数据拿不到,用selenium

    (4)selenium的初步应用

    # -*- coding: utf-8 -*-
    import scrapy
    from selenium import webdriver  #第一步,导包
    
    #第二步,实例化浏览器,保证浏览器只能实例化一次
    
    class WangyiSpider(scrapy.Spider):
        name = 'wangyi'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://news.163.com/']
        def __init__(self):
            # 第二步,实例化一个浏览器对象(保证实例化一次)
            self.bro=webdriver.Chrome(executable_path='./chromedriver')
        def closed(self,spider):
            # 最后一步:必须在整个爬虫结束后,才关闭浏览器
            print('爬虫结束')
            self.bro.quit
    
        def parse(self, response):
            lis=response.xpath('//div[@class="ns_area list"]/ul/li')
            indexs=[3,4,6,7,8]
            li_list=[] #存储的就是国内,国际,军事,航空,无人机五个模块对应的li标签对象
            for index in indexs:
                li_list.append(lis[index])
            #获取四个板块中的链接和文字标题
            for li in li_list:
                url=li.xpath('./a/@href').extract_first()
                title=li.xpath('./a/text()').extract_first()
                # print(url+":"+title)
    
                #对每一个板块对应的url发起请求,获取页面数据(标题,缩略图,关键字,发布时间,url)
                yield scrapy.Request(url=url,callback=self.parseSecond)
        def parseSecond(self,response):
            div_list=response.xpath('//div[@class="data_row news_article clearfix"]')
            print(len(div_list))
            for div in div_list:
                head=div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first()
                url=div.xpath('.//div[@class="news_title"]/h3/a/@href').extract_first()
                imgUrl=div.xpath('./a/img/@src').extract_first()
                publish_t=div.xpath('.//div[@class="news_tag"]/span/text()').extract_first()
                tag=div.xpath('.//div[@class="keywords"]/a/text()').extract()
                tag="".join(tag)

    (5)selenium在下载中间件中的配置2

     爬虫文件wangyi.py

    # -*- coding: utf-8 -*-
    import scrapy
    from selenium import webdriver  #第一步,导包
    
    #第二步,实例化浏览器,保证浏览器只能实例化一次
    
    class WangyiSpider(scrapy.Spider):
        name = 'wangyi'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://news.163.com/']
        def __init__(self):
            # 第二步,实例化一个浏览器对象(保证实例化一次)
            self.bro=webdriver.Chrome(executable_path='./wangyiPro/chromedriver.exe')
        def closed(self,spider):
            # 最后一步:必须在整个爬虫结束后,才关闭浏览器
            print('爬虫结束')
            self.bro.quit()
    
        def parse(self, response):
            lis=response.xpath('//div[@class="ns_area list"]/ul/li')
            indexs=[3,4,6,7,8]
            li_list=[] #存储的就是国内,国际,军事,航空,无人机五个模块对应的li标签对象
            for index in indexs:
                li_list.append(lis[index])
            #获取四个板块中的链接和文字标题
            for li in li_list:
                url=li.xpath('./a/@href').extract_first()
                title=li.xpath('./a/text()').extract_first()
                # print(url+":"+title)
    
                #对每一个板块对应的url发起请求,获取页面数据(标题,缩略图,关键字,发布时间,url)
                yield scrapy.Request(url=url,callback=self.parseSecond)
        def parseSecond(self,response):
            div_list=response.xpath('//div[@class="data_row news_article clearfix"]')
            # print(len(div_list))
            for div in div_list:
                head=div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first()
                url=div.xpath('.//div[@class="news_title"]/h3/a/@href').extract_first()
                imgUrl=div.xpath('./a/img/@src').extract_first()
                publish_t=div.xpath('.//div[@class="news_tag"]/span/text()').extract_first()
                tag=div.xpath('.//div[@class="keywords"]/a/text()').extract()
                tag="".join(tag)

    middlewares.py

    from scrapy import signals
    
    class WangyiproDownloaderMiddleware(object):
    
        def process_request(self, request, spider):
    
            return None
        #拦截响应的对象(拦截下载器传递给Spider的响应对象)
        #request:响应对象对应的请求对象
        #response:拦截到的响应对象
        #spider:爬虫文件中对应的爬虫类的实例
        def process_response(self, request, response, spider):
            print(request.url+"这是下载中间件")
    
            return response

    开启UA和robot以及下载中间件

    settings.py

    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    DOWNLOADER_MIDDLEWARES = {
       'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
    }

     (6)网易爬虫程序wangyi.py

    # -*- coding: utf-8 -*-
    import scrapy
    from selenium import webdriver  #第一步,导包
    
    #第二步,实例化浏览器,保证浏览器只能实例化一次
    
    class WangyiSpider(scrapy.Spider):
        name = 'wangyi'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://news.163.com/']
    
        def __init__(self):
            # 第二步,实例化一个浏览器对象(保证实例化一次)
            self.bro=webdriver.Chrome(executable_path='./wangyiPro/chromedriver.exe')
            # urls = []
        def closed(self,spider):
            # 最后一步:必须在整个爬虫结束后,才关闭浏览器
            print('爬虫结束')
            self.bro.quit()
    
        def parse(self, response):
            lis=response.xpath('//div[@class="ns_area list"]/ul/li')
            indexs=[3,4,6,7,]
            li_list=[] #存储的就是国内,国际,军事,航空,无人机五个模块对应的li标签对象
            # global urls
            for index in indexs:
                li_list.append(lis[index])
            #获取四个板块中的链接和文字标题
            for li in li_list:
                url=li.xpath('./a/@href').extract_first()
                # urls.append(url)
                title=li.xpath('./a/text()').extract_first()
                # print(url+":"+title)
                print(url)
                #对每一个板块对应的url发起请求,获取页面数据(标题,缩略图,关键字,发布时间,url)
                yield scrapy.Request(url=url,callback=self.parseSecond)
        def parseSecond(self,response):
            div_list=response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')
            print(len(div_list))
            for div in div_list:
                head=div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first()
                url=div.xpath('.//div[@class="news_title"]/h3/a/@href').extract_first()
                imgUrl=div.xpath('./a/img/@src').extract_first()
                publish_t=div.xpath('.//div[@class="news_tag"]/span/text()').extract_first()
                tag=div.xpath('.//div[@class="keywords"]/a/text()').extract()
                tag="".join(tag)

    settings.py文件

    DOWNLOADER_MIDDLEWARES = {
       'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
    }
    LOG_LEVEL = 'ERROR'
    
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False

    middlewares.py

    from scrapy import signals
    from time import sleep
    from scrapy.http import HtmlResponse
    class WangyiproDownloaderMiddleware(object):
    
        def process_request(self, request, spider):
            return None
        #拦截响应的对象(拦截下载器传递给Spider的响应对象)
        #request:响应对象对应的请求对象
        #response:拦截到的响应对象
        #spider:爬虫文件中对应的爬虫类的实例
        def process_response(self, request, response, spider):
            # print(request.url+"这是下载中间件")
            #响应对象中存储页面数据的篡改         #spider.url_list可以提前拿出来,问题出在这个http是否加s和无人机这个模块是独特的解析方式上边
            if request.url in['http://news.163.com/domestic/','http://news.163.com/world/','http://war.163.com/','http://news.163.com/air/']:
            # if request.url in spider.urls:
            #     print('this is process_response!!!!!!!!!!!!!!!!!!!!!!!!!1')
                spider.bro.get(url=request.url)
                sleep(2)
                #页面数据就是包含了动态加载出来的新闻数据对应的页面数据
                page_text=spider.bro.page_source
    
                return HtmlResponse(url=spider.bro.current_url,body=page_text,encoding='utf-8',request=request)
                # 实例化新的响应对象,作用域
            else:
                return response

    (7)向下滑动与打印解析的内容(),这个有问题明天需要测试一下

     wangyi.py爬虫文件

    # -*- coding: utf-8 -*-
    import scrapy
    from selenium import webdriver  #第一步,导包
    
    #第二步,实例化浏览器,保证浏览器只能实例化一次
    
    class WangyiSpider(scrapy.Spider):
        name = 'wangyi'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://news.163.com/']
    
        def __init__(self):
            # 第二步,实例化一个浏览器对象(保证实例化一次)
            self.bro=webdriver.Chrome(executable_path='./wangyiPro/chromedriver.exe')
            # urls = []
        def closed(self,spider):
            # 最后一步:必须在整个爬虫结束后,才关闭浏览器
            print('爬虫结束')
            self.bro.quit()
    
        def parse(self, response):
            lis=response.xpath('//div[@class="ns_area list"]/ul/li')
            indexs=[3,4,6,7,]
            li_list=[] #存储的就是国内,国际,军事,航空,无人机五个模块对应的li标签对象
            # global urls
            for index in indexs:
                li_list.append(lis[index])
            #获取四个板块中的链接和文字标题
            for li in li_list:
                url=li.xpath('./a/@href').extract_first()
                # urls.append(url)
                title=li.xpath('./a/text()').extract_first()
                # print(url+":"+title)
                print(url)
                #对每一个板块对应的url发起请求,获取页面数据(标题,缩略图,关键字,发布时间,url)
                yield scrapy.Request(url=url,callback=self.parseSecond)
        def parseSecond(self,response):
            div_list=response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')
            print(len(div_list))
            for div in div_list:
                head=div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first()
                url=div.xpath('.//div[@class="news_title"]/h3/a/@href').extract_first()
                imgUrl=div.xpath('./a/img/@src').extract_first()
                publish_t=div.xpath('.//div[@class="news_tag"]/span/text()').extract_first()
                tag=div.xpath('.//div[@class="keywords"]/a/text()').extract()
                tags=[]
                for t in tag:
                    t=t.strip('
     	')
                    tags.append(t)
                tag="".join(tags)
                # print(head+":"+url+":"+imgUrl+":"+tag)

    settings.py爬虫文件

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for wangyiPro project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'wangyiPro'
    
    SPIDER_MODULES = ['wangyiPro.spiders']
    NEWSPIDER_MODULE = 'wangyiPro.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'wangyiPro (+http://www.yourdomain.com)'
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'wangyiPro.middlewares.WangyiproSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    DOWNLOADER_MIDDLEWARES = {
       'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
    }
    LOG_LEVEL = 'ERROR'
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    #ITEM_PIPELINES = {
    #    'wangyiPro.pipelines.WangyiproPipeline': 300,
    #}
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    View Code

    middlewares.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your spider middleware
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    from scrapy import signals
    from time import sleep
    from scrapy.http import HtmlResponse
    class WangyiproDownloaderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
    
    
        def process_request(self, request, spider):
            # Called for each request that goes through the downloader
            # middleware.
    
            # Must either:
            # - return None: continue processing this request
            # - or return a Response object
            # - or return a Request object
            # - or raise IgnoreRequest: process_exception() methods of
            #   installed downloader middleware will be called
            return None
        #拦截响应的对象(拦截下载器传递给Spider的响应对象)
        #request:响应对象对应的请求对象
        #response:拦截到的响应对象
        #spider:爬虫文件中对应的爬虫类的实例
        def process_response(self, request, response, spider):
            # print(request.url+"这是下载中间件")
            #响应对象中存储页面数据的篡改         #spider.url_list可以提前拿出来
            if request.url in['http://news.163.com/domestic/','http://news.163.com/world/','http://war.163.com/','http://news.163.com/air/']:
            # if request.url in spider.urls:
            #     print('this is process_response!!!!!!!!!!!!!!!!!!!!!!!!!1')
                spider.bro.get(url=request.url)
                sleep(2)
    
                # js='window.scrollTo(0,documnet.body.scrollHeight)'
                # spider.bro.execute_script(js)
                # sleep(2)  #缓冲加载数据,一定要给予浏览器一定的缓冲加载数据的时间
                #页面数据就是包含了动态加载出来的新闻数据对应的页面数据
    
                page_text=spider.bro.page_source
    
                return HtmlResponse(url=spider.bro.current_url,body=page_text,encoding='utf-8',request=request)
                # 实例化新的响应对象,作用域
            else:
                return response

     items.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class WangyiproItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        pass

    pipelines.py

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class WangyiproPipeline(object):
        def process_item(self, item, spider):
            # print(item['title']+':'+item['content'])
            return item

    执行:

    scrapy crawl wangyi --nolog

    (8)加上传入管道和传递item,以及meta传递参数和获取文章内容的编写

    爬虫文件wangyi.py

    # -*- coding: utf-8 -*-
    import scrapy
    from selenium import webdriver  #第一步,导包
    from wangyiPro.items import WangyiproItem
    #第二步,实例化浏览器,保证浏览器只能实例化一次
    
    class WangyiSpider(scrapy.Spider):
        name = 'wangyi'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://news.163.com/']
    
        def __init__(self):
            # 第二步,实例化一个浏览器对象(保证实例化一次)
            self.bro=webdriver.Chrome(executable_path='./wangyiPro/chromedriver.exe')
            # urls = []
        def closed(self,spider):
            # 最后一步:必须在整个爬虫结束后,才关闭浏览器
            print('爬虫结束')
            self.bro.quit()
    
        def parse(self, response):
            lis=response.xpath('//div[@class="ns_area list"]/ul/li')
            indexs=[3,4,6,7,]
            li_list=[] #存储的就是国内,国际,军事,航空,无人机五个模块对应的li标签对象
            # global urls
            for index in indexs:
                li_list.append(lis[index])
            #获取四个板块中的链接和文字标题
            for li in li_list:
                url=li.xpath('./a/@href').extract_first()
                # urls.append(url)
                title=li.xpath('./a/text()').extract_first()
                # print(url+":"+title)
                print(url)
                #对每一个板块对应的url发起请求,获取页面数据(标题,缩略图,关键字,发布时间,url)
                yield scrapy.Request(url=url,callback=self.parseSecond,meta={'title':title})#请求传参
        def parseSecond(self,response):
            div_list=response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')
            print(len(div_list))
            for div in div_list:
                head=div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first()
                url=div.xpath('.//div[@class="news_title"]/h3/a/@href').extract_first()
                imgUrl=div.xpath('./a/img/@src').extract_first()
                publish_t=div.xpath('.//div[@class="news_tag"]/span/text()').extract_first()
                tag=div.xpath('.//div[@class="keywords"]/a/text()').extract()
                tags=[]
                for t in tag:
                    t=t.strip('
     	')
                    tags.append(t)
                tag="".join(tags)
    
                #获取meta传递过来的数据值title
                title=response.meta['title']
                #实例化item对象,将解析到的数据值存储到item对象中
                item =WangyiproItem()
                item['head']=head
                item['url']=url
                item['imgUrl']=imgUrl
                item['publish_t']=publish_t
                item['tag']=tag
                item['title']=title
    
                #对url发起请求,获取对应页面中存储的新闻内容数据
                yield scrapy.Request(url=url, callback=self.getContent, meta={'item': item})
                # print(head+":"+url+":"+imgUrl+":"+tag)
        def getContent(self,response):
            #获取传递过来的item
            item=response.meta['item']
    
            #解析当前页面中存储的新闻数据
            content_list=response.xpath('//div[@class="post_text"]/p/text()').extract()
            content="".join(content_list)
            item['content']=content
            yield item

    items.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class WangyiproItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        # pass
        head = scrapy.Field()
        url = scrapy.Field()
        imgUrl = scrapy.Field()
        publish_t = scrapy.Field()
        tag = scrapy.Field()
        title = scrapy.Field()
        content = scrapy.Field()

    pipelines.py

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class WangyiproPipeline(object):
        def process_item(self, item, spider):
            print(item['title']+':'+item['content'])
            return item

    settings.py

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for wangyiPro project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'wangyiPro'
    
    SPIDER_MODULES = ['wangyiPro.spiders']
    NEWSPIDER_MODULE = 'wangyiPro.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'wangyiPro (+http://www.yourdomain.com)'
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'wangyiPro.middlewares.WangyiproSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    DOWNLOADER_MIDDLEWARES = {
       'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
    }
    LOG_LEVEL = 'ERROR'
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'wangyiPro.pipelines.WangyiproPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

    middlewares.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your spider middleware
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    from scrapy import signals
    from time import sleep
    from scrapy.http import HtmlResponse
    class WangyiproDownloaderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
    
    
        def process_request(self, request, spider):
            # Called for each request that goes through the downloader
            # middleware.
    
            # Must either:
            # - return None: continue processing this request
            # - or return a Response object
            # - or return a Request object
            # - or raise IgnoreRequest: process_exception() methods of
            #   installed downloader middleware will be called
            return None
        #拦截响应的对象(拦截下载器传递给Spider的响应对象)
        #request:响应对象对应的请求对象
        #response:拦截到的响应对象
        #spider:爬虫文件中对应的爬虫类的实例
        def process_response(self, request, response, spider):
            # print(request.url+"这是下载中间件")
            #响应对象中存储页面数据的篡改         #spider.url_list可以提前拿出来
            if request.url in['http://news.163.com/domestic/','http://news.163.com/world/','http://war.163.com/','http://news.163.com/air/']:
            # if request.url in spider.urls:
            #     print('this is process_response!!!!!!!!!!!!!!!!!!!!!!!!!1')
                spider.bro.get(url=request.url)
                sleep(2)
    
                # js='window.scrollTo(0,documnet.body.scrollHeight)'
                # spider.bro.execute_script(js)
                # sleep(2)  #缓冲加载数据,一定要给予浏览器一定的缓冲加载数据的时间
                #页面数据就是包含了动态加载出来的新闻数据对应的页面数据
    
                page_text=spider.bro.page_source
    
                return HtmlResponse(url=spider.bro.current_url,body=page_text,encoding='utf-8',request=request)
                # 实例化新的响应对象,作用域
            else:
                return response

    主要是五个文件的编写:

    wangyi.py爬虫文件

    middlewares.py

    settings.py

    items.py

    pipelines.py

    (9)

  • 相关阅读:
    Amazon EBS的功能更新
    ORA-03113: end-of-file on communication channel
    云serverlinux又一次挂载指定文件夹(非扩充)
    Binder对象死亡通知机制
    cocos2d::ui::TextField 调用setAttachWithIME和setDetachWithIME都无效
    shell linux基本命令实例、笔记
    降智严重——nowcoder练习赛46&&codeforces #561 Div2
    2018-2-13-win10-uwp-隐藏实时可视化
    2018-2-13-win10-uwp-隐藏实时可视化
    2018-2-13-win10-UWP-ListView-模仿开始菜单
  • 原文地址:https://www.cnblogs.com/studybrother/p/11117835.html
Copyright © 2011-2022 走看看