zoukankan      html  css  js  c++  java
  • Scrapy shell使用

    注意:容易出现403错误,实际爬取时不会出现。
    response - a Response object containing the last fetched page
    >>>response.xpath('//title/text()').extract()
     return a list of selectors
    >>>for index, link in enumerate(links):
    ... args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract()) ... print 'Link number %d points to url %s and image %s' % args
    Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg'] Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg'] Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg'] Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg'] Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']
    enumerate() 函数一般用在 for 循环当中。
    普通的 for 循环
    >>>i = 0 >>> seq = ['one', 'two', 'three'] >>> for element in seq: ... print i, seq[i] ... i +=1 ... 0 one 1 two 2 three
    for 循环使用 enumerate
    >>>seq = ['one', 'two', 'three'] >>> for i, element in enumerate(seq): ... print i, seq[i] ... 0 one 1 two 2 three
    suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:
    >>> divs = response.xpath('//div')
    note the dot prefixing the .//p XPath):
    >>> for p in divs.xpath('.//p'): # extracts all <p> inside ... print p.extract()
    Another common case would be to extract all direct <p> children:
    >>> for p in divs.xpath('p'): ... print p.extract()
    在程序中使用shell
    from scrapy.shell import inspect_response inspect_response(response, self)
    Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:
    xpath最外层最好用单引号!
    shell 本地html,方便 调试(但别取名为index.html)
    scrapy shell ./path/to/file.html ,即使在本目录,也必须要加./,不能直接 shell file.html scrapy shell ../other/path/to/file.html scrapy shell /absolute/path/to/file.html
  • 相关阅读:
    python 集合
    jQuery选择器
    hdu 5747 Aaronson
    hdu 2049 不容易系列之(4)——考新郎
    hdu 2048 神、上帝以及老天爷
    hdu 2045 不容易系列之(3)—— LELE的RPG难题
    hdu 2047 阿牛的EOF牛肉串
    hdu 2046 骨牌铺方格
    hdu 2050 折线分割平面
    hdu 2044 一只小蜜蜂
  • 原文地址:https://www.cnblogs.com/elesos/p/7885474.html
Copyright © 2011-2022 走看看