zoukankan      html  css  js  c++  java
  • Beautiful Soup的使用

    使用Beautiful Soup

    1.简介

      简单来说Beautiful Soup是Python的一个HTML或XML解析库,可以用来方便的从网页中提取数据。Beautiful Soup提供了一些简单的Python式的函数来打处理导航,搜索,修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据。

      Beautiful Soup自动将文本文档转换为Unicode编码,输出文档转换为UTF-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时你仅仅需要说明一下原始编码方式就可以了。

    2.准备工作

    安装Beautiful Soup

    a.相关链接

      官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/

      中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh

      PyPi  :  https://pypi.python.org/pypi/beautifulsoup4

    b.pip3安装

      pip3 install beautifulsoup4

    c.whell安装

      从PiPy下载whell文件

      然后使用pip安装whell文件

    3.使用Beautiful Soup

    1.基本用法

    from bs4 import BeautifulSoup
    
    html = """
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>The Beautiful Suop</title>
    </head>
    <body>
    <p class="title" name="dromouse"><b>The story</b></p>
    <p class="story" >once upon a time there were three title sisters;and their name were
    <a href="http://example.com/elsie" class="sister" id="link1">Elise</a>
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
    """
    soup = BeautifulSoup(html, 'lxml')
    print(soup.prettify())
    print(soup.title.string)

    运行结果如下:

    <html lang="en">
     <head>
      <meta charset="utf-8"/>
      <title>
       The Beautiful Suop
      </title>
     </head>
     <body>
      <p class="title" name="dromouse">
       <b>
        The story
       </b>
      </p>
      <p class="story">
       once upon a time there were three title sisters;and their name were
       <a class="sister" href="http://example.com/elsie" id="link1">
        Elise
       </a>
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
        and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>
    The Beautiful Suop

      这里首先声明一个变量html,它是一个HTML字符串。但是需要注意,它并不是一个完成的HTML字符串,body和html节点没有闭合。接着我们将它作为第一个参数传递给Beautiful Soup对象,第二个参数为解析器的类型(这里使用的是lxml),此时就完成了Beautiful Soup对象的初始化。然后将这个对象复制给soup变量。接下来就可以调用soup的各个方法和属性来解析这串HTML代码了。

      首先,调用prettify()方法。这个方法可以把要解析的字符串以标准的缩进格式输出。这里需要注意的是,输出结果包含了body和html节点,也就是说对于不标准的HTML代码Beautiful Soup可以自动更正格式。这一步并不是prettify()做的,而是在初始化时就已经完成了。

      然后调用soup.title.string。这实际上是输出HTML中title节点的文本内容。So,soup.title可以选出HTML中的节点,再调用string属性就可以得到里面的文本了。

    2.节点选择器

    直接调用节点的名称就可以选择节点元素,在调用string就可以得到节点的文本了。选择方式非常快速,如果单个节点层次非常清晰,可以选用这种方法。

      ♦选择元素  

    from bs4 import BeautifulSoup
    
    html = """
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>The Beautiful Suop</title>
    </head>
    <body>
    <p class="title" name="dromouse"><b>The story</b></p>
    <p class="story" >once upon a time there were three title sisters;and their name were
    <a href="http://example.com/elsie" class="sister" id="link1">Elise</a>
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
    """
    soup = BeautifulSoup(html, 'lxml')
    print(soup.title)
    print(type(soup.title))
    print(soup.title.string)
    print(soup.head)
    print(soup.p)

    运行结果如下:

    <title>The Beautiful Suop</title>
    <class 'bs4.element.Tag'>
    The Beautiful Suop
    <head>
    <meta charset="utf-8"/>
    <title>The Beautiful Suop</title>
    </head>
    <p class="title" name="dromouse"><b>The story</b></p>

      这里依旧选用刚才的示例代码,首先打印title节点的选择结果,输出title节点的文本内容。接下来是它的类型,<class 'bs4.element.Tag'>这是Beautiful Soup中一个重要的数据结构。

      接下来,我们又尝试了head节点,p节点,选择p节点时只是输出了第一个p节点的内容。当有多个节点时,这种方式只会匹配到第一个节点,后面的节点都会忽略。

      ♦提取信息

        如何获取节点的属性值?获取节点的名称?

      (1)名称获取

      利用name属性获取节点的名称  

    print(soup.title.name)
    
    输出结果:
    
    title

      (2)获取属性

      每个节点可以有多个属性,例如id和class等,选择这个节点后可以调用attrs获取所有属性:

    print(soup.p.attrs)
    运行结果:
    {'class': ['title'], 'name': 'dromouse'}

      可以看到,attrs返回的结果是字典型式,把所有属性的和属性值组成了一个字典。如果想获取name属性,只需要加上键值,可以使用attrs['name']来获取。有一种更简便的写法,直接在节点元素后面加上属性名称:

    print(soup.p['name'])
    print(soup.p['class'])
    
    输出结果:
    dromouse
    ['title']

      这里需要注意的是,有的结果返回的是字符串,有的结果返回的是列表。比如name属性的值是唯一的,返回的结果就是单个字符串,class的属性可以有多个,所有返回的是一个列表。需要在实际使用中判断。

    (3)获取内容

      可以使用string获取内容

    print(soup.p.string)
    
    输出结果:
    The story

    这里的p节点是第一个p节点

      ♦嵌套选择

      在上面的例子中,每一步的返回结果都是bs4.element.Tag,我们可以继续调用节点进行下一步:

    print(soup.head.title)
    print(type(soup.head.title))
    print(soup.head.title.string)

      输出结果:

    <title>The Beautiful Suop</title>
    <class 'bs4.element.Tag'>
    The Beautiful Suop
      ♦关联选择

      先选取某一个节点元素,在以它为基准去选择其父节点,子节点,兄弟节点等。

    (1)子节点及子孙节点

      使用contents属性获取子节点

    from bs4 import BeautifulSoup
    
    html = """
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>The Beautiful Suop</title>
    </head>
    <body>
    <p class="story" >once upon a time there were three title sisters;and their name were
    <a href="http://example.com/elsie" class="sister" id="link1">
        <span>Elise</span>
    </a>
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
    """
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.contents)

    输出结果:

    ['once upon a time there were three title sisters;and their name were
    ', 
    <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a>, ' ',
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' ',
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';
    and they lived at the bottom of a well.
    ']

      p节点里包含文本,节点,所以返回一个列表形式。

      使用children可以得到相同的结果,此时返回的是一个生成器类型。

    print(soup.p.children)
    for i, child in enumerate(soup.p.children):
        print(i, child)

    输出结果:

    <list_iterator object at 0x0000016B477884A8>
    0 once upon a time there were three title sisters;and their name were
    
    1 <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elise</span>
    </a>
    2 
    
    3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    4 
    
    5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    6 ;
        and they lived at the bottom of a well.

      使用descendants属性获取子孙节点,返回一个生成器,输出的结果包含了span节点。descendants会查询所有子节点,得到所有的子孙节点

    <generator object descendants at 0x0000029DA472D9E8>
    0 once upon a time there were three title sisters;and their name were
    
    1 <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elise</span>
    </a>
    2 
    
    3 <span>Elise</span>
    4 Elise
    5 
    
    6 
    
    7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    8 Lacie
    9 
    
    10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    11 Tillie
    12 ;
        and they lived at the bottom of a well.

    (2)父节点和爷爷节点

      调用parent获取某个节点的父节点;

    print(soup.a.parent)

      输出结果:

    <p class="story">once upon a time there were three title sisters;and their name were
    <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elise</span>
    </a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

      很明显,a的直接父节点是p节点,这里直接输出p节点的内容。

      调用parents选取到爷爷节点,返回的结果是生成器类型,用列表输出了它的索引和内容,列表中的元素就是a节点的祖先节点。

    print(type(soup.a.parents))
    print(list(enumerate(soup.a.parents)))

      输出结果:

    <class 'generator'>
    [(0, <p class="story">once upon a time there were three title sisters;and their name were
    <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elise</span>
    </a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>), 

    (1, <body> <p class="story">once upon a time there were three title sisters;and their name were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> </body>),

    (2, <html lang="en"> <head> <meta charset="utf-8"/> <title>The Beautiful Suop</title> </head> <body> <p class="story">once upon a time there were three title sisters;and their name were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> </body></html>),

    (3, <html lang="en"> <head> <meta charset="utf-8"/> <title>The Beautiful Suop</title> </head> <body> <p class="story">once upon a time there were three title sisters;and their name were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> </body></html>)]

    (3)兄弟节点

      同级节点获取,next_sibling和previous_sibling分别获取的是节点的下一个兄弟元素和节点的上一个兄弟元素。next_siblings和previous_siblings分别返回后面和前面的所有兄弟元素。

    from bs4 import BeautifulSoup
    
    html = """
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>The Beautiful Suop</title>
    </head>
    <body>
    <p class="story" >once upon a time there were three title sisters;and their name were
    
    <a href="http://example.com/elsie" class="sister" id="link1">
        <span>Elise</span>
    </a>
    hello
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>
    """
    soup = BeautifulSoup(html, 'lxml')
    print("Next Sibling:", soup.a.next_sibling)
    print("Prev Sibling:", soup.a.previous_sibling)
    print("Next Siblings:", list(soup.a.next_siblings))
    print("Prev Siblings:", list(soup.a.previous_siblings))

      输出结果;

    Next Sibling: 
    hello
    
    Prev Sibling: once upon a time there were three title sisters;and their name were
    
    
    Next Siblings: ['
    hello
    ', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '
    ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';
        and they lived at the bottom of a well.
    ']
    Prev Siblings: ['once upon a time there were three title sisters;and their name were
    
    ']

    (4)信息提取

      单个节点可以直接调用string,attrs等属性获取其文本内容和属性,多个节点的生成器转化为列表后,取到某个节点后再调用string,attrs等属性获取相对应的节点的文本和属性。

    from bs4 import BeautifulSoup
    
    html = """
    <html lang="en">
    <body>
    <p class="story" >once upon a time there were three title sisters;and their name were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elise</span>
        </a>
    </p>
    """
    soup = BeautifulSoup(html, 'lxml')
    
    print(soup.a.next_sibling.string)
    print(list(soup.a.parents)[0])
    print(list(soup.a.parents)[0].attrs['class'])

      输出结果:

    <p class="story">once upon a time there were three title sisters;and their name were
        <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elise</span>
    </a>
    </p>
    ['story']

     3.方法选择器

      ♦find_all()

      查询所有符合条件的元素,给它传入一些属性和文本就可以得到符合条件的元素,功能十分强大

      find_all(name,attrs,recursive,text,**kwargs)

      (1)name

       根据节点名称查询元素:

    from bs4 import BeautifulSoup
    
    html = """
    <div class="panel">
        <div class="panel-heading">
            <h4>hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
            
        </div>
    </div>
    """
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(name='ul'))
    print(type(soup.find_all(name='ul')[0]))

    输出结果:

    [
    <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2">
    <li class="element">Foo</li> <li class="element">Bar</li> </ul>
    ]
    <class 'bs4.element.Tag'>

      调用find_all()方法,name参数的值为ul,查询到所有ul节点,返回列表类型,每个元素都是bs4.element.Tag类型。key继续进行嵌套查询,查询其内部的li节点:

    for ul in soup.find_all(name='ul'):
        print(ul.find_all(name='li'))

      输出结果:

    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>]

      遍历每个li,获取其文本内容:

    for ul in soup.find_all(name='ul'):
        print(ul.find_all(name='li'))
        for li in ul.find_all(name='li'):
            print(li.string)

      输出结果:

    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    Foo
    Bar
    Jay
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
    Foo
    Bar

     (2)attrs

      根据传入的属性查询:

    print(soup.find_all(attrs={'id': 'list-1'}))

      输出结果:

    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>]

      对于一些常见的属性id和class,可以直接使用,不需要attrs。其中class为Python关键字,需要加上下划线:class_='element'

    print(soup.find_all(id='list-1'))
    print(soup.find_all(class_='element'))

      输出结果:

    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>]
    [<li class="element">Foo</li>, 
    <li class="element">Bar</li>,
    <li class="element">Jay</li>,
    <li class="element">Foo</li>,
    <li class="element">Bar</li>]

      (3)text

      text参数可以匹配节点的文本,传入的形式可以是字符串,可以是正则表达式对象,:

     import re
    
     print(soup.find_all(text=re.compile('F')))

      输出结果:

    ['Foo', 'Foo']

      ♦find()方法

      find()方法返回的是单个元素,也就是第一个匹配的元素。  

    print(soup.find(name='ul'))
    print(soup.find(class_='list'))

      输出结果:

    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>
    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>

      这里还有很多类似的方法:

      find_parent():返回父节点

      find_parents():返回祖先节点

      find_next_sibling():返回后面的第一个兄弟节点

      find_next_siblings():返回后面所有的兄弟节点

      find_previous_sibling():返回前面的第一个兄弟节点

      find_previous_siblings():返回前面所有的兄弟节点

      find_next():返回节点后面第一个符合条件的节点

      find_all_next():返回节点后面所有符合条件的节点

      find_previous():返回节点前面第一个符合条件的节点

      find_all_previous():返回节点前面所有符合条件的节点

     4.CSS选择器

      使用CSS选择器只需要调用select()方法,传入响应的CSS选择器:  

    print(soup.select('.panel .panel-heading'))
    print(soup.select('ul li'))
    print(soup.select('#list-2 .element'))

      输出结果:

    [<div class="panel-heading">
    <h4>hello</h4>
    </div>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
      ♦嵌套选择

      遍历每个ul节点,选择其中的li节点: 

    for ul in soup.select('ul'):
        print(ul.select('li'))

      输出结果:

    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
      ♦获取属性
    for ul in soup.select('ul'):
        print(ul['id'])
        print(ul.attrs['id'])

      输出结果:

    list-1
    list-1
    list-2
    list-2
      ♦获取文本

      想要获取文本,除了string以后还可以使用get_text():

    # 获取文本
    for li in soup.select('li'):
        print(li.get_text())
        print(li.string)

      输出结果:

    Foo
    Foo
    Bar
    Bar
    Jay
    Jay
    Foo
    Foo

      推荐使用lxml解析库

      节点筛选虽然功能弱但是快

      建议使用find() 和find_all()匹配单个或多个

      熟悉CSS的可以使用select()进行匹配

  • 相关阅读:
    去除金额千分位,还原成数字
    替换对象的key
    合并两个对象的属性
    js常用数组方法
    document对象的一些属性
    js数字四舍五入保留n位小数
    js时间日期类常用方法
    数字转换成千分位格式
    valueOf获取日期时间初始值
    常见的数据库Cause:Packet for query is too large(xxx > 1024)
  • 原文地址:https://www.cnblogs.com/zivli/p/10845856.html
Copyright © 2011-2022 走看看