zoukankan      html  css  js  c++  java
  • python爬虫数据解析的四种不同选择器Xpath,Beautiful Soup,pyquery,re

    这里主要是做一个关于数据爬取以后的数据解析功能的整合,方便查阅,以防混淆

    主要讲到的技术有Xpath,BeautifulSoup,PyQuery,re(正则)

    首先举出两个作示例的代码,方便后面举例

    解析之前需要先将html代码转换成相应的对象,各自的方法如下:

    Xpath:

    In [7]: from lxml import etree
    
    In [8]: text = etree.HTML(html)

    BeautifulSoup:

    In [2]: from bs4 import BeautifulSoup
    
    In [3]: soup = BeautifulSoup(html, 'lxml')

     PyQuery:

    In [10]: from pyquery import PyQuery as pq
    
    In [11]: doc = pq(html)

     re:没有需要的对象,他是直接对字符串进行匹配的规则

    示例1

    html = '''
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    </body>
    </html>
    '''

      接下来我们来用不同的解析方法分析示例的HTML代码

    匹配标题内容:

    Xpath:

    In [16]: text.xpath('//title/text()')[0]
    Out[16]: "The Dormouse's story"

    BeautifulSoup:

    In [18]: soup.title.string
    Out[18]: "The Dormouse's story"

    PyQuery:

    In [20]: doc('title').text()
    Out[20]: "The Dormouse's story"

    re:

    In [11]: re.findall(r'<title>(.*?)</title></head>', html)[0]
    Out[11]: "The Dormouse's story"

    匹配第三个a标签的href属性:

    Xpath:#推荐

    In [36]: text.xpath('//a[@id="link3"]/@href')[0]
    Out[36]: 'http://example.com/tillie'

    BeautifulSoup:

    In [27]: soup.find_all(attrs={'id':'link3'})
    Out[27]: [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


    In [33]: soup.find_all(attrs={'id':'link3'})[0].attrs['href']
    Out[
    33]: 'http://example.com/tillie'

     PyQuery:#推荐

    In [45]: doc("#link3").attr.href
    Out[45]: 'http://example.com/tillie'

     re:

    In [46]: re.findall(r'<a href="(.*?)" class="sister" id="link3">Tillie</a>;', html)[0]
    Out[46]: 'http://example.com/tillie'

    匹配P标签便是内容的全部数据:

    Xpath:

    In [48]: text.xpath('string(//p[@class="story"])').strip()
    Out[48]: 'Once upon a time there were three little sisters; and their names were
    Elsie,
    Lacie and
    Tillie;
    and they lived at the bottom of a well.'
    
    In [51]: ' '.join(text.xpath('string(//p[@class="story"])').split('
    '))
    Out[51]: 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.'

    BeautifulSoup:

    In [89]: ' '.join(list(soup.body.stripped_strings)).replace('
    ', '')
    Out[89]: "The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie,Lacie and Tillie; and they lived at the bottom of a well. ..."

    PyQuery:

    In [99]: doc('.story').text()
    Out[99]: 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ...'

    re:不推荐使用,过于麻烦

    In [101]: re.findall(r'<p class="story">(.*?)<a href="http://example.com/elsie" class="sister" id="link1">(.*?)</a>(.*?)<a href="http://example.com/lacie" class="siste
         ...: r" id="link2">(.*?)</a>(.*?)<a href="http://example.com/tillie" class="sister" id="link3">(.*?)</a>;(.*?)</p>', html, re.S)[0]
    Out[101]:
    ('Once upon a time there were three little sisters; and their names were
    ',
     'Elsie',
     ',
    ',
     'Lacie',
     ' and
    ',
     'Tillie',
     '
    and they lived at the bottom of a well.')

    示例2

    html = '''
    <div>
    <ul>
    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
    </div>
    '''

     匹配second item

    Xpath:

    In [14]: text.xpath('//li[2]/a/text()')[0]
    Out[14]: 'second item'

    BeautifulSoup:

    In [23]: soup.find_all(attrs={'class': 'item-1'})[0].string
    Out[23]: 'second item'

    PyQuery:

    In [34]: doc('.item-1>a')[0].text
    Out[34]: 'second item'

    re:

    In [35]: re.findall(r'<li class="item-1"><a href="link2.html">(.*?)</a></li>', html)[0]
    Out[35]: 'second item'

    匹配第五个li标签的href属性:

    Xpath:

    In [36]: text.xpath('//li[@class="item-0"]/a/@href')[0]
    Out[36]: 'link5.html'

    BeautifulSoup:

    In [52]:  soup.find_all(attrs={'class': 'item-0'})
    Out[52]:
    [<li class="item-0">first item</li>,
     <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>,
     <li class="item-0"><a href="link5.html">fifth item</a></li>]
    
    In [53]: soup.find_all(attrs={'class': 'item-0'})[-1].a.attrs['href']
    Out[53]: 'link5.html'

    PyQuery:

    In [75]: [i.attr.href for i in doc('.item-0 a').items()][1]
    Out[75]: 'link5.html'

    re:

    In [95]: re.findall(r'<li class="item-0" ><a href="(.*?)">fifth item</a></li>',html)[0]
    Out[95]: 'link5.html'

    示例3

    <li><span class="label">房屋用途</span>普通住宅</li>

    分别获取出房屋用途和普通住宅

    Xpath:

    In [47]: text.xpath('//li/span/text()')[0]
    Out[47]: '房屋用途'
    
    In [49]: text.xpath('//li/text()')[0]
    Out[49]: '普通住宅'

    BeautifulSoup:

    In [65]: soup.span.string
    Out[65]: '房屋用途'
    
    In [69]: soup.li.contents[1] # contents 获取直接子节点
    Out[69]: '普通住宅'

    PyQuery:

    In [70]: doc('li span').text()
    Out[70]: '房屋用途'
    
    In [75]: doc('li .label')[0].tail
    Out[75]: '普通住宅'

    re: 略


    示例4

    <div class="unitPrice">
        <span class="unitPriceValue">26667<i>元/平米</i></span>
    </div>

    分别获取26667和元/平米

    Xpath:

    In [81]: text.xpath('//div[@class="unitPrice"]/span/text()')[0]
    Out[81]: '26667'
    
    In [82]: text.xpath('//div[@class="unitPrice"]/span/i/text()')[0]
    Out[82]: '元/平米'

    BeautifulSoup:

    In [97]: [i for i in soup.find('div', class_="unitPrice").strings]
    Out[97]: ['
    ', '26667', '元/平米', '
    ']
    
    In [98]: [i for i in soup.find('div', class_="unitPrice").strings][1]
    Out[98]: '26667'
    
    In [99]: [i for i in soup.find('div', class_="unitPrice").strings][2]
    Out[99]: '元/平米'

    PyQuery:

    In [109]: doc('.unitPrice .unitPriceValue')[0].text
    Out[109]: '26667'
    
    In [110]: doc('.unitPrice .unitPriceValue i')[0].text
    Out[110]: '元/平米'
  • 相关阅读:
    android之下载416错误
    eclipse之常用工具总结
    php之Callback 回调类型
    smarty的自定义函数
    Unable to open sync connection异常
    android之android Studio 安装后打不开的解决方法
    android之ExpandableListView 的滑动到底部的精确监听事件
    wampserver2.0下配置虚拟主机
    wc之初认识
    php开发中常见函数记录
  • 原文地址:https://www.cnblogs.com/pywjh/p/9971241.html
Copyright © 2011-2022 走看看