zoukankan      html  css  js  c++  java
  • scrapy初体验

    scrapypython开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。scrapy的安装稍显麻烦,不过按照以下步骤去进行,相信你也能很轻松的安装使用scrapy

    安装python2.7

    scrapy1.0.3暂时只支持python2.7

    # wget https://www.python.org/ftp/python/2.7.6/Python-2.7.6.tgz

    [root@rocket software]# tar -zxvf Python-2.7.6.tgz    # 解压

    [root@rocket software]# cd Python-2.7.6

    [root@rocket software]# mkdir /usr/local/python27   # 创建安装目录

    [root@rocket software]# ./configure --prefix=/usr/local/python27

    [root@rocket software]# make

    [root@rocket software]# make install

    # 目前安装的版本是2.6,需要替换成2.7

    [root@rocket software]# mv /usr/bin/python /usr/bin/python2.6.6

    [root@rocket software]# ln -s /usr/local/python27/bin/python /usr/bin/python

    这里需要注意的是,由于原有系统安装的是python2.6,升级了python2.7,那么yum也会出问题

    clip_image002

    需要修改yum使用python2.6的版本

    clip_image004

    clip_image006

    安装setuptools

    进入官网,下载到本地,解压

    https://pypi.python.org/pypi/setuptools#downloads

    [root@rocket software]# cd setuptools-18.1

    [root@rocket setuptools-18.1]# python setup.py install

    安装pip

    进入官网,下载到本地,解压

    https://pypi.python.org/pypi/pip#downloads

    [root@rocket software]# cd pip-7.1.2

    [root@rocket pip-7.1.2]# python setup.py install

    安装Twisted

    进入官网,下载到本地,解压

    wget https://pypi.python.org/packages/source/T/Twisted/Twisted-15.4.0.tar.bz2

    [root@rocket software]# cd Twisted-15.4.0

    [root@rocket Twisted-15.4.0]# python setup.py install

    安装scrapy

    pip install scrapy

    在这个过程中,遇到以下问题

     

    1 pip安装模块警告InsecurePlatformWarning: A true SSLContext object is not available.

    yum install python-devel libffi-devel openssl-devel

    pip install pyopenssl ndg-httpsclient pyasn1

    在运行pip就不会出现警告了

     

    2 安装lxml失败

    clip_image008

    解决方法是先安装libxslt开发包:

    yum install libxslt-devel

    确认安装成功

    [root@rocket software]# rpm -qa | grep libxml

    libxml2-devel-2.7.6-20.el6.x86_64

    libxml2-python-2.7.6-20.el6.x86_64

    libxml2-2.7.6-20.el6.x86_64

     

    3 安装cffi失败

    clip_image010

    [root@rocket software]# yum -y install libffi-devel

    [root@rocket software]# rpm -qa | grep libffi

    libffi-3.0.5-3.2.el6.x86_64

    libffi-devel-3.0.5-3.2.el6.x86_64

     

    4 安装openssl失败

    clip_image012

    [root@rocket software]# yum -y install openssl-devel

    [root@rocket software]# rpm -qa | grep openssl

    openssl-devel-1.0.1e-42.el6.x86_64

    openssl-1.0.1e-42.el6.x86_64

     

    解决完以上几个问题后,重新执行

    pip install scrapy

    能够顺利安装成功。

    确认安装成功

    [root@rocket Twisted-15.4.0]# python

    Python 2.7.6 (default, Oct 27 2015, 01:21:45)

    [GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2

    Type "help", "copyright", "credits" or "license" for more information.

    >>> import scrapy

    没报错,安装成功。

    开始第一个scrapy任务

    详细介绍请参考

    http://scrapy-chs.readthedocs.org/zh_CN/latest/intro/overview.html

     

    [root@rocket scrapy]# scrapy startproject mininova

    clip_image013

    运行的时候报错,注意运行的时候,必须在mininova的主目录中运行,不然会报错

     

    编写items.py

    import scrapy
    
    class MininovaItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        url = scrapy.Field()
        name = scrapy.Field()
        description = scrapy.Field()
        size = scrapy.Field()

    编写spiders/mininova_spiders.py

    from scrapy.spiders import CrawlSpider, Rule, Spider
    from scrapy.linkextractors import LinkExtractor
    import scrapy
    from mininova.items import MininovaItem
    
    class MininovaSpider(scrapy.spiders.CrawlSpider):
        name = 'mininova'
        allowed_domains = ['mininova.org']
        start_urls = ['http://www.mininova.org/today']
        rules = [Rule(LinkExtractor(allow=['/tor/d+']), 'parse_torrent')]
    
        def parse_torrent(self, response):
            torrent = MininovaItem()
            torrent['url'] = response.url
            torrent['name'] = response.xpath("//h1/text()").extract()
            torrent['description'] = response.xpath("//div[@id='description']").extract()
            torrent['size'] = response.xpath("//div[@id='info-left']/p[2]/text()[2]").extract()
            return torrent

    运行

    [root@rocket mininova]# pwd

    /home/demo/scrapy/mininova

    [root@rocket mininova]# scrapy crawl mininova -o scraped_data.json

    clip_image015

    需要安装 sqlite-devel库,再重新编译安装Python

    yum install sqlite-devel

    [root@rocket software]# yum install sqlite-devel

    [root@rocket software]# ./configure --prefix=/usr/local/python27

    [root@rocket software]# make

    [root@rocket software]# make install

    这样就可以找到sqlite3的库了

    [root@rocket software]# cd /usr/local/python27/lib/python2.7/lib-dynload/

    [root@rocket lib-dynload]# ll|grep sql

    -rwxr-xr-x. 1 root root 240971 Oct 28 01:17 _sqlite3.so

     

    [root@rocket mininova]# scrapy crawl mininova -o scraped_data.json

    clip_image017

    终于可以跑起来了。。

     

    接下来我们将进一步对scrapy的工作原理进行分析,并给出更为实用的例子。

     

  • 相关阅读:
    Spring Cloud Eureka的学习
    Maven环境配置
    Maven解决静态资源过滤问题
    Linux Desktop Entry文件配置解析
    iptables规则持久化
    Markdown学习总结
    输vim /etc/rc.d/init.d/mysqld 报错 …..localdomain.pid
    UE4 集成讯飞听写插件
    单机梦幻西游
    使用A*寻路小记
  • 原文地址:https://www.cnblogs.com/linuxbug/p/4923970.html
Copyright © 2011-2022 走看看