zoukankan      html  css  js  c++  java
  • 爬取报刊名称及地址

    目标:爬取全国报刊名称及地址

    链接:http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm

    目的:练习scrapy爬取数据

    学习过scrapy的基本使用方法后,我们开始写一个最简单的爬虫吧。

    目标截图:

      1、创建爬虫工程

    1
    2
    $ cd ~/code/crawler/scrapyProject
    $ scrapy startproject newSpapers

      2、创建爬虫程序

    1
    2
    $ cd newSpapers/
    $ scrapy genspider nationalNewspaper news.xinhuanet.com 

      3、配置数据爬取项 

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    $ cat items.py
    # -*- coding: utf-8 -*-
     
    # Define here the models for your scraped items
    #
    # See documentation in:
    # http://doc.scrapy.org/en/latest/topics/items.html
     
    import scrapy
     
     
    class NewspapersItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        name = scrapy.Field()
        addr = scrapy.Field()

     4、 配置爬虫程序

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    $ cat spiders/nationalNewspaper.py
    # -*- coding: utf-8 -*-
    import scrapy
    from newSpapers.items import NewspapersItem
     
    class NationalnewspaperSpider(scrapy.Spider):
        name = "nationalNewspaper"
        allowed_domains = ["news.xinhuanet.com"]
        start_urls = ['http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm']
     
        def parse(self, response):
            sub_country = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[2]')
            sub2_local = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[4]')
            tags_a_country = sub_country.xpath('./td/table/tbody/tr/td/p/a')
            items = []
            for each in tags_a_country:
                item = NewspapersItem()
                item['name'] = each.xpath('./strong/text()').extract()
                item['addr'] = each.xpath('./@href').extract()
                items.append(item)
            return items

      5、配置谁去处理爬取结果

    1
    2
    3
    4
    $ cat settings.py
    ……
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    ITEM_PIPELINES = {'newSpapers.pipelines.NewspapersPipeline':100}

      6、配置数据处理程序

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    $ cat pipelines.py
    # -*- coding: utf-8 -*-
     
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
     
    import time
    class NewspapersPipeline(object):
        def process_item(self, item, spider):
            now = time.strftime('%Y-%m-%d',time.localtime())
            filename = 'newspaper.txt'
            print '================='
            print item
            print '================'
            with open(filename,'a'as fp:
                fp.write(item['name'][0].encode("utf8")+ ' ' +item['addr'][0].encode("utf8") + ' ')
            return item

      7、查看结果

    1
    2
    3
    4
    5
    6
    7
    $ cat spiders/newspaper.txt
    人民日报    http://paper.people.com.cn/rmrb/html/2007-09/20/node_17.htm
    海外版 http://paper.people.com.cn/rmrbhwb/html/2007-09/20/node_34.htm
    光明日报    http://www.gmw.cn/01gmrb/2007-09/20/default.htm
    经济日报    http://www.economicdaily.com.cn/no1/
    解放军报    http://www.gmw.cn/01gmrb/2007-09/20/default.htm
    中国日报    http://pub1.chinadaily.com.cn/cdpdf/cndy/
  • 相关阅读:
    java的构造方法 this 重载
    容器,组件,面板
    首先定义一个5X8的二维数组,然后使用随机数填充满。借助Arrays的方法对二维数组进行排序。
    首先创建一个长度是5的数组,并填充随机数。首先用选择法正排序,然后再对其使用冒泡法倒排序
    创建一个长度是5的数组,并填充随机数。使用for循环或者while循环,对这个数组实现反转效果
    寻找某两个数相除,其结果 离黄金分割点 0.618最近,分母和分子不能同时为偶数 * 分母和分子 取值范围在[1-20]
    密码的自动生成器:密码由大写字母/小写字母/数字组成,生成12位随机密码
    vue倒计时:天时分秒
    tbody设置超出固定的高度出现滚动条,没超出不显示。
    获取某个日期的当前周一的时间
  • 原文地址:https://www.cnblogs.com/HomeG/p/10527123.html
Copyright © 2011-2022 走看看