zoukankan      html  css  js  c++  java
  • Scrapy shell使用

    注意:容易出现403错误,实际爬取时不会出现。
    response - a Response object containing the last fetched page
    >>>response.xpath('//title/text()').extract()
     return a list of selectors
    >>>for index, link in enumerate(links):
    ... args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract()) ... print 'Link number %d points to url %s and image %s' % args
    Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg'] Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg'] Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg'] Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg'] Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']
    enumerate() 函数一般用在 for 循环当中。
    普通的 for 循环
    >>>i = 0 >>> seq = ['one', 'two', 'three'] >>> for element in seq: ... print i, seq[i] ... i +=1 ... 0 one 1 two 2 three
    for 循环使用 enumerate
    >>>seq = ['one', 'two', 'three'] >>> for i, element in enumerate(seq): ... print i, seq[i] ... 0 one 1 two 2 three
    suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:
    >>> divs = response.xpath('//div')
    note the dot prefixing the .//p XPath):
    >>> for p in divs.xpath('.//p'): # extracts all <p> inside ... print p.extract()
    Another common case would be to extract all direct <p> children:
    >>> for p in divs.xpath('p'): ... print p.extract()
    在程序中使用shell
    from scrapy.shell import inspect_response inspect_response(response, self)
    Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:
    xpath最外层最好用单引号!
    shell 本地html,方便 调试(但别取名为index.html)
    scrapy shell ./path/to/file.html ,即使在本目录,也必须要加./,不能直接 shell file.html scrapy shell ../other/path/to/file.html scrapy shell /absolute/path/to/file.html
  • 相关阅读:
    常用的ORM框架与自动化映射工具
    数据持久化
    char、nchar、varchar、nvarchar 的区别
    MS SqlServer还原数据库,出现媒体簇的结构不正确
    根据字段查询所有表名
    java中package指什么
    MybatisGenerator生成的mapper 少了识别主键的方法 byPrimaryKey()
    Sql Server触发器的使用
    WebService生成XML文档时出错。不应是类型XXXX。使用XmlInclude或SoapInclude属性静态指定非已知的类型。
    sqlserver记录去重
  • 原文地址:https://www.cnblogs.com/elesos/p/7885474.html
Copyright © 2011-2022 走看看