zoukankan      html  css  js  c++  java
  • [scrapy]实例:爬取jobbole页面

    工程概览:

    创建工程

    scrapy startproject  ArticleSpider  

    创建spider

    cd /ArticleSpider/spiders/
    新建jobbole.py
    
    
    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.http import Request
    from urllib import parse
    import re
    
    from ArticleSpider.items import ArticlespiderItem
    
    
    class JpbboleSpider(scrapy.Spider):
        name = 'jobbole'
        allowed_domains = ['blog.jobbole.com']
        start_urls = ['http://blog.jobbole.com/all-posts/']  #先下载http://blog.jobbole.com/all-posts/这个页面,然后传给parse解析
    
        def parse(self, response):
    
            #1.start_urls下载页面http://blog.jobbole.com/all-posts/,然后交给parse解析,parse里的post_urls获取这个页面的每个文章的url,Request下载每个文章的页面,然后callback=parse_detail,交给parse_detao解析
            #2.等post_urls这个循环执行完,说明这一个的每个文章都已经解析完了, 就执行next_url,next_url获取下一页的url,然后Request下载,callback=self.parse解析,parse从头开始,先post_urls获取第二页的每个文章的url,然后循环每个文章的url,交给parse_detail解析
    
            #获取http://blog.jobbole.com/all-posts/中所有的文章url,并交给Request去下载,然后callback=parse_detail,交给parse_detail解析
            post_urls = response.css("#archive  .floated-thumb .post-thumb a::attr(href)").extract()
            for post_url in post_urls:
                yield Request(url=parse.urljoin(response.url,post_url),callback=self.parse_detail)
    
            #获取下一页的url地址,交给Request下载,然后交给parse解析
            next_url = response.css(".next.page-numbers::attr(href)").extract_first("")
            if next_url:
                yield Request(url=next_url,callback=self.parse)
    
        def parse_detail(self,response):
            title=response.css('.entry-header h1::text').extract()[0]
            create_date= response.css("p.entry-meta-hide-on-mobile::text").extract()[0]
            praise_unms = response.css(".vote-post-up h10::text").extract()[0]
            fav_nums = response.css(".bookmark-btn::text").extract()[0]
            match_re = re.match(".*?(d+).*",fav_nums)
            if match_re:
                fav_nums = int(match_re.group(1))
            else:
                fav_nums = 0
            comment_nums = response.css("a[href='#article-comment'] span::text").extract()[0]
            match_re = re.match(".*?(d+).*",comment_nums)
            if match_re:
                comment_nums = int(match_re.group(1))
            else:
                comment_nums = 0
            item = ArticlespiderItem()  #实例化item
            item['name'] = title        #item里的name字段的内容就是这里的title
            yield item                  #执行item
    
            print(title,create_date,praise_unms,fav_nums,comment_nums)  

    items.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # http://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class ArticlespiderItem(scrapy.Item):
        # define the fields for your item here like:
        name = scrapy.Field()
    
        

    piplines.py

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    # class ArticlespiderPipeline(object):
    #     def process_item(self, item, spider):
    #         return item
    
    
    class ArticlespiderPipeline(object):
        def process_item(self, item, spider):
            with open("my_meiju.txt", 'a') as fp:
                fp.write(item['name'] + '
    ')
    
  • 相关阅读:
    geoserver源码maven编译相关问题
    openlayers2地图控件扩展:要素删除DeleteFeature
    openlayers2地图控件扩展:图例控件LegendControl
    [小游戏资源] 微信小游戏开发资源目录
    【转】利用 three.js 开发微信小游戏的尝试
    微信小游戏开发之四:使用three.js引擎
    【转】微信小游戏开发源码_教程_工具_资源最新集合
    【转】微信小游戏学习
    微信小游戏开发Canvas资源汇总
    【转】微信小游戏开发总结
  • 原文地址:https://www.cnblogs.com/chadiandianwenrou/p/8038391.html
Copyright © 2011-2022 走看看