zoukankan      html  css  js  c++  java
  • 如何运行简单的scrapy

    1.建scrapy工程

    scrapy startproject python123demo

    2.在工程中写一个爬虫文件

    cd python123demo

    scrapy genspider demo python123.io

    3.写爬虫的配置文件

    4.运行爬虫

    scrapy crawl demo

    运行的时候出了一些小问题,这些问题是在安装scrapy时没有把关联的包安装上导致的。

    ModuleNotFoundError: No module named 'win32api'

    上述问题需要

    pywin32-221-cp36-cp36m-win_amd64.whl这个包

    ImportError: DLL load failed: 找不到指定的模块。

    上述问题是由于没有成功安装pywin32-221-cp36-cp36m-win_amd64.whl这个包

    重新运行生成的pywin32_postinstall.py文件即可

    python.exe Scriptspywin32_postinstall.py -install

    但是可能还会出现错误,

    F:Python36>python.exe Scriptspywin32_postinstall.py -install
    Copied pythoncom36.dll to F:Python36pythoncom36.dll
    Copied pywintypes36.dll to F:Python36pywintypes36.dll
    You do not have the permissions to install COM objects.
    The sample COM objects were not registered.
    -> SoftwarePythonPythonCore3.6Help[None]=None
    -> SoftwarePythonPythonCore3.6HelpPythonwin Reference[None]='F:\Python36\Lib\site-packages\PyWin32.chm'
    Pythonwin has been registered in context menu
    Creating directory F:Python36Libsite-packageswin32comgen_py
    Can't install shortcuts - 'C:\Users\asus\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Python 3.6' is not a folder
    The pywin32 extensions were successfully installed.

    很明显,需要使用管理员权限使用上述命令


    成功则显示如下信息


    PS C:WINDOWSsystem32> f:
    PS F:> cd .Python36
    PS F:Python36> python.exe Scriptspywin32_postinstal
    Copied pythoncom36.dll to C:WINDOWSsystem32pythonc
    Copied pywintypes36.dll to C:WINDOWSsystem32pywint
    Registered: Python.Interpreter
    Registered: Python.Dictionary
    Registered: Python
    -> SoftwarePythonPythonCore3.6Help[None]=None
    -> SoftwarePythonPythonCore3.6HelpPythonwin Refe
    Pythonwin has been registered in context menu
    Shortcut for Pythonwin created
    Shortcut to documentation created
    The pywin32 extensions were successfully installed.

    再次运行爬虫,终于成功了

    F:pyProjectpython123demo>scrapy crawl demo
    2017-10-29 09:16:43 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: python123demo)
    2017-10-29 09:16:43 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'python123demo', 'NEWSPIDER_MODULE': 'python123demo.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['python123demo.spiders']}
    2017-10-29 09:16:43 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
    'scrapy.extensions.telnet.TelnetConsole',
    'scrapy.extensions.logstats.LogStats']
    2017-10-29 09:16:43 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
    'scrapy.downloadermiddlewares.retry.RetryMiddleware',
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
    'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2017-10-29 09:16:43 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
    'scrapy.spidermiddlewares.referer.RefererMiddleware',
    'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
    'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2017-10-29 09:16:43 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2017-10-29 09:16:43 [scrapy.core.engine] INFO: Spider opened
    2017-10-29 09:16:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2017-10-29 09:16:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    2017-10-29 09:16:44 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://python123.io/robots.txt> from <GET http://python123.io/robots.txt>
    2017-10-29 09:16:44 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://python123.io/robots.txt> (referer: None)
    2017-10-29 09:16:44 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://python123.io/ws/demo.html> from <GET http://python123.io/ws/demo.html>
    2017-10-29 09:16:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://python123.io/ws/demo.html> (referer: None)
    2017-10-29 09:16:44 [scrapy.core.scraper] ERROR: Spider error processing <GET https://python123.io/ws/demo.html> (referer: None)
    Traceback (most recent call last):
    File "f:python36libsite-packages wistedinternetdefer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
    File "F:pyProjectpython123demopython123demospidersdemo.py", line 14, in parse
    self.log('Save file %s.' % name)
    NameError: name 'name' is not defined
    2017-10-29 09:16:44 [scrapy.core.engine] INFO: Closing spider (finished)
    2017-10-29 09:16:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 884,
    'downloader/request_count': 4,
    'downloader/request_method_count/GET': 4,
    'downloader/response_bytes': 1595,
    'downloader/response_count': 4,
    'downloader/response_status_count/200': 1,
    'downloader/response_status_count/301': 2,
    'downloader/response_status_count/404': 1,
    'finish_reason': 'finished',
    'finish_time': datetime.datetime(2017, 10, 29, 1, 16, 44, 929393),
    'log_count/DEBUG': 5,
    'log_count/ERROR': 1,
    'log_count/INFO': 7,
    'response_received_count': 2,
    'scheduler/dequeued': 2,
    'scheduler/dequeued/memory': 2,
    'scheduler/enqueued': 2,
    'scheduler/enqueued/memory': 2,
    'spider_exceptions/NameError': 1,
    'start_time': datetime.datetime(2017, 10, 29, 1, 16, 44, 121136)}
    2017-10-29 09:16:44 [scrapy.core.engine] INFO: Spider closed (finished)

    总结:为防止出现scrapy相关依赖安装失败,可以自己逐个下载依赖

    https://www.lfd.uci.edu/~gohlke/pythonlibs/

    lxml

    pywin32

    Twisted

    OpenSSL

    依赖放在scripts下

    通过pip install 对应whl文件即可

    最后使用import对应模块判断是否安装成功

    最后可通过命令升级scrapy

  • 相关阅读:
    利用virtual box安装ubuntu16.4,没有继续(下一步)的解决方案
    最好用的几个谷歌镜像(推荐理由:无广告)
    vs2017和vs2019专业版和企业版
    c# List根据某个属性进行分类,变成以属性名称作为分类的多个List
    vs2015安装编辑神器:resharper10.0
    c# 正则表达式替换字符串中常见的特殊字符
    IL中间语言指令大全
    c#进阶一:使用ILDASM来查看c#中间语言
    SQL server脚本语句积累
    SQLServer事务在C#当中的应用
  • 原文地址:https://www.cnblogs.com/anqiang1995/p/7749362.html
Copyright © 2011-2022 走看看