zoukankan      html  css  js  c++  java
  • scrapy中Selector的使用

    scrapy的Selector选择器其实也可以用来解析,今天主要总结下css和xpath的用法,其实我个人最喜欢用css

    以慕课网嵩天老师教程中的一个网页为例,python123.io/ws/demo.html

    解析是提取信息的一种手段,主要提取的信息包括:标签节点、属性、文本,下面从这三个方面来分别说明

    一、提取标签节点

    response = ”<html><head><title>This is a python demo page</title></head>
    <body>
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
    </body></html>”

    上面这个就是网页的html信息了,比如我要提取<p>标签

    使用css选择器

    selector = Selector(text=response)
    p = selector.css('p').extract()
    print(p)
    #['<p class="title"><b>The demo python introduces several python courses.</b></p>', '<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>']

    这样就得到了所有p节点的信息,得到的是一个列表信息,如果只想得到第一个,实际上可以使用extract_first()方法,而不是使用extract()方法

    对于简单的节点查找,这样就够了,但是如果同样的节点很多,而且我要查找的节点不在第一个,这样处理就不行。解决的方法是添加限制条件,添加class、id等等限制信息

    比如我想提取class=course的p节点信息,使用p[class='course'],当然,如果有其他的属性,也可以用其他属性作为限定

    selecor = Selector(text=result)
    response = selecor.css('p[class="course"]').extract_first()
    print(response)

    #<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>

    使用xpath

    使用xpath大体思路也是一样的,只不过语法有点不同

    使用xpath实现上述第一个例子

    selecor = Selector(text=result)
    response = selecor.xpath('//p').extract_first()
    print(response)

    使用xpath实现上述第二个例子

    selecor = Selector(text=result)
    response = selecor.xpath('//p[@class="course"]').extract_first()
    print(response)

    细心点的可能会发现xpath选取标签节点,就比css多了个//和@,//代表从当前节点进行选择,@后面接的是属性

    二、提取属性

    有时候我们需要提取属性值,比如src、href

    response = ”<html><head><title>This is a python demo page</title></head>
    <body>
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
    </body></html>”

    还是这段例子,为了方便观看,我拷过来

    比如我现在要提取第一个a标签的href

    使用css

    直接在标签后面加上::attr(href),attr代表提取的是属性,括号内的href代表我要提取的是哪种属性

    selecor = Selector(text=result)
    response = selecor.css('a::attr(href)').extract_first()
    print(response)
    #http://www.icourse163.org/course/BIT-268001

    如果要提取特性的a标签的href属性,比如第二个a标签的href,同样可以使用限制条件

    selecor = Selector(text=result)
    response = selecor.css('a[class="py2"]::attr(href)').extract_first()
    print(response)
    #http://www.icourse163.org/course/BIT-1001870001

    使用xpath

    实现上面第一个例子

    selecor = Selector(text=result)
    response = selecor.xpath('//a/@href').extract_first()
    print(response)

    实现上面第二个例子

    selecor = Selector(text=result)
    response = selecor.xpath('//a[@class="py2"]/@href').extract_first()
    print(response)

    三、提取文本信息

    response = ”<html><head><title>This is a python demo page</title></head>
    <body>
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
    </body></html>”

    提取第一个a标签的文本

    使用css选择器

    只需要在标签后面加上::text,至于怎么选择标签参照上面

    selecor = Selector(text=result)
    response = selecor.css('a::text').extract_first()
    print(response)
    #Basic Python

    选择特定标签的文本,比如第二个a标签文本,同样是加一个限制条件就好

    selecor = Selector(text=result)
    response = selecor.css('a[class="py2"]::text').extract_first()
    print(response)
    #Advanced Python

    使用xpath来实现

    首先是第一个例子,使用//a选择到a节点,再/text()选择到文本信息

    selecor = Selector(text=result)
    response = selecor.xpath('//a/text()').extract_first()
    print(response)

    实现第二个例子,添加xpath限制条件的时候前面一定不要忘记加@,而且text后面要加()

    selecor = Selector(text=result)
    response = selecor.xpath('//a[@class="py2"]/text()').extract_first()
    print(response)

    最后总结下:对于提取而言,xpath多了/和@符号,即使在添加限制条件时,xpath也需要在限制的属性前加@,所以这也是我喜欢css的原因,因为我懒。

  • 相关阅读:
    ububtu 14.04 问题集合
    ubuntu grub 引导修复
    Ubuntu 下 glpk 的安装及使用
    ubuntu vim 7.4 编译安装
    ubuntu 12.04 clang 3.4 安装
    CMakeLists实战解读--YouCompleteMe
    Flume安装及部署
    SpringBoot整合kafka
    linux安装kafka
    Linux安装zookeeper
  • 原文地址:https://www.cnblogs.com/sjfeng1987/p/9930285.html
Copyright © 2011-2022 走看看