zoukankan      html  css  js  c++  java
  • Pyspider上手

    pyspider安装: pip3 install Pyspider

    启动服务操作

    1、打开cmd:输入        pyspider  --help 回车,可以查看帮助信息,pyspider all 启动command服务

    2、启动后看到0.0.0.0.5000 提示就启动好了,打开浏览器127.0.0.1:5000或者http://localhost:5000/ 打开pyspider的web UI界面,

    3、首先点击creat创建项目,名字任意

    4、右边web页面代码如下:

    #!/usr/bin/env python

    # -*- encoding: utf-8 -*-
    # Created on 2018-08-22 23:16:23
    # Project: TripAdvisor

    from pyspider.libs.base_handler import *


    class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
    self.crawl('__START_URL__', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
    for each in response.doc('a[href^="http"]').items():
    self.crawl(each.attr.href, callback=self.detail_page)

    @config(priority=2)
    def detail_page(self, response):
    return {
    "url": response.url,
    "title": response.doc('title').text(),
    }

    把__START_URL__替换成要爬取的网站地址,进行save,点击左边的run按钮,点击左边窗体的follow点击《、》箭头

    第一次尝试pyspider,出师未捷身先死,,,599了,立马百度下PySpider HTTP 599: SSL certificate problem错误的解决方法,发现有同病相怜的小伙伴,学习下经验https://blog.csdn.net/asmcvc/article/details/51016485

    报错完整的代码(每个人安装的python路径不一样地址会有差异)

    [E 180822 23:51:45 base_handler:203] HTTP 599: SSL certificate problem: self signed certificate in certificate chain
        Traceback (most recent call last):
          File "e:programspythonpython36libsite-packagespyspiderlibsase_handler.py", line 196, in run_task
            result = self._run_task(task, response)
          File "e:programspythonpython36libsite-packagespyspiderlibsase_handler.py", line 175, in _run_task
            response.raise_for_status()
          File "e:programspythonpython36libsite-packagespyspiderlibs
    esponse.py", line 172, in raise_for_status
            six.reraise(Exception, Exception(self.error), Traceback.from_string(self.traceback).as_traceback())
          File "e:programspythonpython36libsite-packagessix.py", line 692, in reraise
            raise value.with_traceback(tb)
          File "e:programspythonpython36libsite-packagespyspiderfetcher	ornado_fetcher.py", line 378, in http_fetch
            response = yield gen.maybe_future(self.http_client.fetch(request))
          File "e:programspythonpython36libsite-packages	ornadohttpclient.py", line 102, in fetch
            self._async_client.fetch, request, **kwargs))
          File "e:programspythonpython36libsite-packages	ornadoioloop.py", line 458, in run_sync
            return future_cell[0].result()
          File "e:programspythonpython36libsite-packages	ornadoconcurrent.py", line 238, in result
            raise_exc_info(self._exc_info)
          File "<string>", line 4, in raise_exc_info
        Exception: HTTP 599: SSL certificate problem: self signed certificate in certificate chain

    错误原因:

    这个错误会发生在请求 https 开头的网址,SSL 验证错误,证书有误。

    解决方法:

    使用 self.crawl(url, callback=self.index_page, validate_cert=False)                ------------------------------validate_cert=False要放在每个提取页里面不然打开子页面还是会599,吐血

    代码如下:

     1 #!/usr/bin/env python
     2 # -*- encoding: utf-8 -*-
     3 # Created on 2018-08-23 23:06:13
     4 # Project: v2ex
     5 
     6 from pyspider.libs.base_handler import *
     7  
     8  
     9 class Handler(BaseHandler):
    10     crawl_config = {
    11     }
    12  
    13     @every(minutes=24 * 60)
    14     def on_start(self):
    15         self.crawl('https://www.v2ex.com/?tab=tech', callback=self.index_page,validate_cert=False)
    16  
    17     @config(age=10 * 24 * 60 * 60)
    18     def index_page(self, response):
    19         for each in response.doc('a[href^="https://www.v2ex.com/?tab="]').items():
    20             self.crawl(each.attr.href, callback=self.tab_page, validate_cert=False)
    21     
    22     @config(priority=2)
    23     def tab_page(self, response):
    24         for each in response.doc('a[href^="https://www.v2ex.com/go/"]').items():
    25             self.crawl(each.attr.href, callback=self.board_page, validate_cert=False)
    26             
    27             
    28     @config(priority=2)
    29     def board_page(self, response):
    30         for each in response.doc('a[href^="https://www.v2ex.com/t/"]').items():
    31             url = each.attr.href
    32             if url.find('#reply') > 0:
    33                 url = url[0:url.find('#')]
    34             self.crawl(url, callback=self.detail_page, validate_cert=False)        
    35             
    36     
    37     
    38     @config(priority=2)
    39     def detail_page(self, response):
    40         title = response.doc('h1').text()
    41         content = response.doc('div.topic_content')
    42         return {
    43             "url": response.url,
    44             "title": response.doc('title').text(),
    45         }

    这个方法基本可以解决问题了(浏览器要手动刷新下,用360安全浏览器貌似有这个小问题,可能是我设置的问题,果断换chrome和火狐试了下,没发现这个情况。。。)

    For Linux and MAC systems, please refer to the following links:

    https://blog.csdn.net/WebStudy8/article/details/51610953

    逆风的方向更适合飞翔,不怕千万人阻挡,只怕自己投降!
  • 相关阅读:
    Hdu 5445 Food Problem (2015长春网络赛 ACM/ICPC Asia Regional Changchun Online)
    Hdu 3966 Aragorn's Story (树链剖分 + 线段树区间更新)
    大数模板 (C ++)
    后缀数组 DC3构造法 —— 详解
    后缀数组 (倍增法) 个人理解
    Hdu 2888 Check Corners (二维RMQ (ST))
    Marriage Match IV---hdu3416(spfa + Dinic)
    The Shortest Path in Nya Graph---hdu4725(spfa+扩点建图)
    Tram---poj1847(简单最短路)
    昂贵的聘礼---poj1062(最短路)
  • 原文地址:https://www.cnblogs.com/jackzz/p/9521253.html
Copyright © 2011-2022 走看看