zoukankan      html  css  js  c++  java
  • scrapy爬虫成长日记之创建工程-抽取数据-保存为json格式的数据

      在安装完scrapy以后,相信大家都会跃跃欲试想定制一个自己的爬虫吧?我也不例外,下面详细记录一下定制一个scrapy工程都需要哪些步骤。如果你还没有安装好scrapy,又或者为scrapy的安装感到头疼和不知所措,可以参考下前面的文章安装python爬虫scrapy踩过的那些坑和编程外的思考。这里就拿博客园来做例子吧,抓取博客园的博客列表并保存到json文件。

    环境:CentOS 6.0 虚拟机

      scrapy(如未安装可参考安装python爬虫scrapy踩过的那些坑和编程外的思考

    1、创建工程cnblogs

    [root@bogon share]# scrapy startproject cnblogs
    2015-06-10 15:45:03 [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot)
    2015-06-10 15:45:03 [scrapy] INFO: Optional features available: ssl, http11
    2015-06-10 15:45:03 [scrapy] INFO: Overridden settings: {}
    New Scrapy project 'cnblogs' created in:
        /mnt/hgfs/share/cnblogs
    
    You can start your first spider with:
        cd cnblogs
        scrapy genspider example example.com

    2、查看下工程的结构

    [root@bogon share]# tree cnblogs/
    cnblogs/
    ├── cnblogs
    │   ├── __init__.py
    │   ├── items.py #用于定义抽取网页结构
    │   ├── pipelines.py #将抽取的数据进行处理
    │   ├── settings.py #爬虫配置文件
    │   └── spiders
    │       └── __init__.py
    └── scrapy.cfg #项目配置文件

    3、定义抽取cnblogs的网页结构,修改items.py

    这里我们抽取四个内容:

    • 文章标题
    • 文章链接
    • 文在所在的列表页URL
    • 摘要
    [root@bogon cnblogs]# vi cnblogs/items.py
    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # http://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class CnblogsItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = scrapy.Field()
        link = scrapy.Field()
        desc = scrapy.Field()
        listUrl = scrapy.Field()
        pass

    4、创建spider

    [root@bogon cnblogs]# vi cnblogs/spiders/cnblogs_spider.py
    
    #coding=utf-8
    import re
    import json
    from scrapy.selector import Selector
    try:
        from scrapy.spider import Spider
    except:
        from scrapy.spider import BaseSpider as Spider
    from scrapy.utils.response import get_base_url
    from scrapy.utils.url import urljoin_rfc
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle
    from cnblogs.items import *
    
    class CnblogsSpider(CrawlSpider):
        #定义爬虫的名称
        name = "CnblogsSpider"
        #定义允许抓取的域名,如果不是在此列表的域名则放弃抓取
        allowed_domains = ["cnblogs.com"]
        #定义抓取的入口url
        start_urls = [
            "http://www.cnblogs.com/rwxwsblog/default.html?page=1"
        ]
        # 定义爬取URL的规则,并指定回调函数为parse_item
        rules = [
            Rule(sle(allow=("/rwxwsblog/default.html?page=d{1,}")), #此处要注意?号的转换,复制过来需要对?号进行转义。
                             follow=True,
                             callback='parse_item')
        ]
        #print "**********CnblogsSpider**********"
        #定义回调函数
        #提取数据到Items里面,主要用到XPath和CSS选择器提取网页数据
        def parse_item(self, response):
            #print "-----------------"
            items = []
            sel = Selector(response)
            base_url = get_base_url(response)
            postTitle = sel.css('div.day div.postTitle')
            #print "=============length======="
            postCon = sel.css('div.postCon div.c_b_p_desc')
            #标题、url和描述的结构是一个松散的结构,后期可以改进
            for index in range(len(postTitle)):
                item = CnblogsItem()
                item['title'] = postTitle[index].css("a").xpath('text()').extract()[0]
                #print item['title'] + "***************
    "
                item['link'] = postTitle[index].css('a').xpath('@href').extract()[0]
                item['listUrl'] = base_url
                item['desc'] = postCon[index].xpath('text()').extract()[0]
                #print base_url + "********
    "
                items.append(item)
                #print repr(item).decode("unicode-escape") + '
    '
            return items

    注意:

      首行要设置为:#coding=utf-8 或 # -*- coding: utf-8 -*- 哦!否则会报错。

    SyntaxError: Non-ASCII character 'xe5' in file /mnt/hgfs/share/cnblogs/cnblogs/spiders/cnblogs_spider.py on line 15, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

      spider的名称为:CnblogsSpider,后面会用到。

    5、修改pipelines.py文件

    [root@bogon cnblogs]# vi cnblogs/pipelines.py
    
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    from scrapy import signals
    import json
    import codecs
    class JsonWithEncodingCnblogsPipeline(object):
        def __init__(self):
            self.file = codecs.open('cnblogs.json', 'w', encoding='utf-8')
        def process_item(self, item, spider):
            line = json.dumps(dict(item), ensure_ascii=False) + "
    "
            self.file.write(line)
            return item
        def spider_closed(self, spider):
            self.file.close()

    注意类名为JsonWithEncodingCnblogsPipeline哦!settings.py中会用到

    6、修改settings.py,添加以下两个配置项

    ITEM_PIPELINES = {
        'cnblogs.pipelines.JsonWithEncodingCnblogsPipeline': 300,
    }
    LOG_LEVEL = 'INFO'

    7、运行spider,scrapy crawl 爬虫名称(cnblogs_spider.py中定义的name)

    [root@bogon cnblogs]# scrapy crawl CnblogsSpider

    8、查看结果more cnblogs.json(pipelines.py中定义的名称

    more cnblogs.json 

    9、如果有需要可以将结果转成txt文本格式,可参考另外一篇文章python将json格式的数据转换成文本格式的数据或sql文件

    源码可在此下载:https://github.com/jackgitgz/CnblogsSpider

    10、相信大家还会有疑问,我们能不能将数据直接保存在数据库呢?答案是可以的,接下来的文章会逐一介绍,敬请期待。

    参考资料:

      http://doc.scrapy.org/en/master/

      http://blog.csdn.net/HanTangSongMing/article/details/24454453

  • 相关阅读:
    Andriod 部署Cocos2d-x项目到Eclipse中
    Andriod 在MAC上搭建开发环境--连接真机测试
    XCode5 破解 免证书连接真机调试
    NSURLConnection 的神奇之处
    NSOperationQueue、NSOperation理解
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
  • 原文地址:https://www.cnblogs.com/rwxwsblog/p/4567052.html
Copyright © 2011-2022 走看看