zoukankan      html  css  js  c++  java
  • Centos下安装Scrapy

    Scrapy是一个开源的机遇twisted框架的python的单机爬虫,该爬虫实际上包含大多数网页抓取的工具包,用于爬虫下载端以及抽取端。

    安装环境:

    centos5.4
    python2.7.3

    安装步骤:

    1.下载python2.7  http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz

    复制代码
    [root@zxy-websgs ~]# wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz -P /opt
    
    [root@zxy-websgs opt]# tar xvf Python-2.7.3.tgz 
    
    [root@zxy-websgs Python-2.7.3]# ./configure 
    
    [root@zxy-websgs Python-2.7.3]# make && make install
    复制代码

     验证python2.7安装

    [root@zxy-websgs Python-2.7.3]# python2.7
    Python 2.7.3 (default, Feb 28 2013, 03:08:43) 
    [GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> exit()

    2.安装setuptools,http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz

    [root@zxy-websgs ~]# wget http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz -P /opt/
    [root@zxy-websgs opt]# tar zxvf setuptools-0.6c11.tar.gz 
    [root@zxy-websgs setuptools-0.6c11]# python2.7 setup.py  install

    3.安装Twisted

    [root@zxy-websgs setuptools-0.6c11]# easy_install Twisted
    ......
    Installed /usr/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg
    ......
    Installed /usr/local/lib/python2.7/site-packages/zope.interface-4.0.4-py2.7-linux-x86_64.egg

    Twisted要安装zope.interface,可以从下面地址下载

    zope.interface:http://pypi.python.org/packages/source/z/zope.interface/zope.interface-4.0.1.tar.gz

    twisted:http://twistedmatrix.com/Releases/Twisted/12.1/Twisted-12.1.0.tar.bz2

    5.安装w3lib

    复制代码
    [root@zxy-websgs setuptools-0.6c11]# easy_install -U w3lib
    Searching for w3lib
    Reading http://pypi.python.org/simple/w3lib/
    Reading http://github.com/scrapy/w3lib
    Best match: w3lib 1.2
    Downloading http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz#md5=f929d5973a9fda59587b09a72f185a9e
    Processing w3lib-1.2.tar.gz
    Running w3lib-1.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-wm_1BB/w3lib-1.2/egg-dist-tmp-2DQHY_
    zip_safe flag not set; analyzing archive contents...
    Adding w3lib 1.2 to easy-install.pth file
    
    Installed /usr/local/lib/python2.7/site-packages/w3lib-1.2-py2.7.egg
    Processing dependencies for w3lib
    Finished processing dependencies for w3lib
    复制代码

    w3lib:http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz

    6.安装libxml2或者用easy_install安装lxml

      安装失败时参考:http://www.coder4.com/archives/3660

    [root@zxy-websgs lxml-3.1.0]# easy_install lxml

    验证lxml安装

    [root@zxy-websgs lxml-3.1.0]# python2.7
    Python 2.7.3 (default, Feb 28 2013, 03:08:43) 
    [GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import lxml
    >>> exit()

    也可以安装libxml2,官网上推荐安装2.6.28或者以上的版本,但在官网上没找到,我先是安装的2.6.9的版本,运行scrapy时报以下错误

    复制代码
    Traceback (most recent call last):
      File "/usr/local/bin/scrapy", line 5, in <module>
        pkg_resources.run_script('Scrapy==0.14.4', 'scrapy')
      File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 489, in run_script
      File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1207, in run_script
      File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
        execute()
      File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 112, in execute
        cmds = _get_commands_dict(inproject)
      File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 37, in _get_commands_dict
        cmds = _get_commands_from_module('scrapy.commands', inproject)
      File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 30, in _get_commands_from_module
        for cmd in _iter_command_classes(module):
      File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 21, in _iter_command_classes
        for module in walk_modules(module_name):
      File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_modules
        submod = __import__(fullpath, {}, {}, [''])
      File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/commands/shell.py", line 8, in <module>
        from scrapy.shell import Shell
      File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/shell.py", line 14, in <module>
        from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector
      File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/__init__.py", line 30, in <module>
        from scrapy.selector.libxml2sel import *
      File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/libxml2sel.py", line 12, in <module>
        from .factories import xmlDoc_from_html, xmlDoc_from_xml
      File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/factories.py", line 14, in <module>
        libxml2.HTML_PARSE_NOERROR + 
    AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'
    复制代码

    升级到2.6.21版本以后解决了。

    libxml2.6.1:ftp://xmlsoft.org/libxml2/python/libxml2-python-2.6.21.tar.gz

    7.安装pyOpenSSL(这个是可选安装的,主要为了使scrapy能够支持https)

    用easy_install pyOpenSSL安装的是pyOpenSSL-0.13版本,没安装成功,于是手动下载.011版本来进行安装。

    [root@zxy-websgs opt]# wget http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz -P /opt
    [root@zxy-websgs opt]# tar zxvf pyOpenSSL-0.11.tar.gz 
    [root@zxy-websgs pyOpenSSL-0.11]# python2.7 setup.py install

    pyOpenSSL:http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz

    8.安装scrapy

    [root@zxy-websgs pyOpenSSL-0.11]# easy_install -U Scrapy

    验证安装

    复制代码
    [root@zxy-websgs pyOpenSSL-0.11]# scrapy
    Scrapy 0.16.4 - no active project
    
    Usage:
      scrapy <command> [options] [args]
    
    Available commands:
      fetch         Fetch a URL using the Scrapy downloader
      runspider     Run a self-contained spider (without creating a project)
      settings      Get settings values
      shell         Interactive scraping console
      startproject  Create new project
      version       Print Scrapy version
      view          Open URL in browser, as seen by Scrapy
    
      [ more ]      More commands available when run from project directory
    
    Use "scrapy <command> -h" to see more info about a command
    复制代码

    scrapy:http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.4.tar.gz

    总结:

    pyOpenSSL单独安装的时候不成功,也可以先下载pyOpenSSL0.11进行安装,再使用easy_install -U Scrapy进行全程安装

    yuanwen :::    http://www.cnblogs.com/xiaoruoen/archive/2013/02/27/2933854.html

  • 相关阅读:
    asp.net实现页面的一般处理程序(CGI)学习笔记
    .NET下的状态(State)模式 行为型模式
    (插件Plugin)AssemblyLoader解决方案(插件开发)
    SQL基础编写基本的SQL SELECT 语句
    在查询语句中使用NOLOCK和READPAST(ZT)
    C# 3.0语言增强学习笔记(一)
    ram,rom,flash
    自动激活你的ActiveX控件
    用C#编写ActiveX控件(二)
    用C#编写ActiveX控件(一)
  • 原文地址:https://www.cnblogs.com/lixiuran/p/3960599.html
Copyright © 2011-2022 走看看