zoukankan      html  css  js  c++  java
  • scrapy 的一个例子

     1、目标:

      scrapy 是一个爬虫构架,现用一个简单的例子来讲解,scrapy 的使用步骤

    2、创建一个scrapy的项目:

      创建一个叫firstSpider的项目,命令如下:

    scrapy startproject firstSpider 
    [jianglexing@cstudio ~]$ scrapy startproject firstSpider 
    New Scrapy project 'firstSpider', using template directory '/usr/local/python-3.6.2/lib/python3.6/site-packages/scrapy/templates/project', created in:
        /home/jianglexing/firstSpider
    
    You can start your first spider with:
        cd firstSpider
        scrapy genspider example example.com

      

    3、创建一个项目时scrapy 命令干了一些什么:

      创建一个项目时scrapy 会创建一个目录,并向目录中增加若干文件

    [jianglexing@cstudio ~]$ tree firstSpider/
    firstSpider/
    ├── firstSpider
    │   ├── __init__.py
    │   ├── items.py
    │   ├── middlewares.py
    │   ├── pipelines.py
    │   ├── __pycache__
    │   ├── settings.py
    │   └── spiders
    │       ├── __init__.py
    │       └── __pycache__
    └── scrapy.cfg
    
    4 directories, 7 files

    4、进入项目所在的目录并创建爬虫:

    [jianglexing@cstudio ~]$ cd firstSpider/
    [jianglexing@cstudio firstSpider]$ scrapy genspider financeSpider www.financedatas.com
    Created spider 'financeSpider' using template 'basic' in module:
      firstSpider.spiders.financeSpider

    5、一只爬虫在scrapy 项目中对应一个文件:

    [jianglexing@cstudio firstSpider]$ tree ./
    ./
    ├── firstSpider
    │   ├── __init__.py
    │   ├── items.py
    │   ├── middlewares.py
    │   ├── pipelines.py
    │   ├── __pycache__
    │   │   ├── __init__.cpython-36.pyc
    │   │   └── settings.cpython-36.pyc
    │   ├── settings.py
    │   └── spiders
    │       ├── financeSpider.py    # 这个就是刚才创建的爬虫文件
    │       ├── __init__.py
    │       └── __pycache__
    │           └── __init__.cpython-36.pyc
    └── scrapy.cfg

    6、编写爬虫的处理逻辑:

      以爬取 http://www.financedatas.com 网站首页的title为例

    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class FinancespiderSpider(scrapy.Spider):
        name = 'financeSpider'
        allowed_domains = ['www.financedatas.com']
        start_urls = ['http://www.financedatas.com/']
    
        def parse(self, response):
            """在parse方法中编写处理逻辑"""
            print('*'*64)
            title=response.xpath('//title/text()').extract() #xpath 语法抽取数据
            print(title)
            print('*'*64)

    7、运行爬虫,查看效果:

    [jianglexing@cstudio spiders]$ scrapy crawl financeSpider
    2017-08-10 16:11:38 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: firstSpider)
    2017-08-10 16:11:38 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'firstSpider', 'NEWSPIDER_MODULE': 'firstSpider.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['firstSpider.spiders']}
    .... ....
    2017-08-10 16:11:39 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.financedatas.com/robots.txt> (referer: None)
    2017-08-10 16:11:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.financedatas.com/> (referer: None)
    ****************************************************************
    ['欢迎来到 www.financedatas.com']   # 这里就抽取到的数据
    ****************************************************************2017-08-10 16:11:39 [scrapy.core.engine] INFO: Spider closed (finished)

    ----

  • 相关阅读:
    Category
    [转]IOS, xib和storyboard的混用
    关于delegate, category和subclass
    iOS 在viewController中监听Home键触发以及重新进入界面的方法
    ios获取当前语言
    Xcode Product -> Archive disabled
    安卓虚拟机启动后报错: 类似 SDK Manager] Error: Error parsing .....devices.xml 解决方案
    Objective-C中一个方法如何传递多个参数的理解
    oc的内存管理
    ios中Raw文件系统常用文件夹
  • 原文地址:https://www.cnblogs.com/JiangLe/p/7339902.html
Copyright © 2011-2022 走看看