zoukankan      html  css  js  c++  java
  • BeautifulSoup基本用法

    BeautifulSoup是Python的一个HTML或XML的解析库,可以用它来方便地从网页提取数据(以下为崔庆才的爬虫书的学习笔记)

    一. 安装方式

    #安装beautifulsoup4
    pip install beautifulsoup4
    
    #安装lxml
    pip install lxml

    二. 基本语法

    1. 节点选择器:基本用法

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story>Once upon a time there are three little sisters; and their names were 
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie -->/a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """

    假如想要获取上述html中的title节点及其文本内容,请看以下语法:

    引入并初始化beautifulsoup

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')

    初始化对于一些不标准的html,可以自动更正格式,如补全标签等等

    获取title节点,查看它的类型

    print(soup.title)
    print(type(soup.title))
    
    
    #输出结果
    <title>The Dormouse's story</title>
    <class 'bs4.element.Tag'>

    获取到的title节点,正是节点加文本内容

    获取title节点文本内容

    print(soup.title.string)
    
    
    #输出结果
    The Dormouse's story

    如果想获取其他信息,比如节点的名字及其属性,这些也比较方便

    获取title节点的名字

    print(soup.title.name)
    
    
    #输出结果
    title

    获取p节点的多个属性和某一属性

    p节点有多个属性,比如class和name等,可以调用attrs获取所有属性

    #获取多个属性
    print(soup.p.attrs)
    
    #输出结果:
    {'class': ['title'], 'name': 'dromouse'}
    
    
    #获取某个属性:方法一
    print(soup.p.attrs['name']
    
    #输出结果:
    dromouse
    
    
    #获取某个属性:方法二
    print(soup.p['name']
    
    #输出结果:
    dromouse
    
    
    #获取单个属性需要注意的地方
    print(soup.p['class'])
    
    #输出结果:
    ['title']

    需要注意的是,有的返回的是字符串,有的返回的是字符串组成的列表。比如,name属性的值是唯一的,返回的结果就是单个字符串,而对于class,一个节点的元素可能有多个class,所以返回的是列表。另外,这里的p节点是第一个p节点

    嵌套选择或层级选择

    如果多个节点层级嵌套在一起,可以通过层级关系依次选择,比如要选择title节点及其内容,之前我们是soup.title,现在可以这样操作:soup.head.title

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    """
    print(soup.head.title)
    print(type(soup.head.title))
    print(soup.head.title.string)
    
    
    #输出结果:
    <title>The Dormouse's story</title>
    <class 'bs4.element.Tag'>
    The Dormouse's story

    2. 节点选择器:高级用法

    父节点和祖先节点

    如果要获取某个节点元素的父节点,可以调用parent属性

    html = """
    <html>
    <head>
    <title>The Dormouse's story</title>
    </head>
    <body>
    <p class="story">
                Once upon a time there were three little sisters; and their names were 
    <a href="http://example.com/elsie" class="sister" id="link1">
    <span>Elsie</span>
    </a>
    </p>
    <p class="story>...</p>
    """
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    print(soup.a.parent)
    
    
    #输出结果:
    <p class="story">
                Once upon a time there were three little sisters; and their names were 
    <a href="http://example.com/elsie" class="sister" id="link1">
    <span>Elsie</span>
    </a>
    </p>

    这里我们选择的是第一个a节点的父节点元素,很明显,它的父节点是p节点,输出结果便是p节点及其内部的内容

    如果想要获取所有的祖先元素,可以调用parents属性:

    html = """
    <html>
    <body>
    <p class="story">
    <a href="http://example.com/elsie" class="sister" id="link1">
    <span>Elsie</span>
    </a>
    </p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(type(soup.a.parents))
    print(list(enumerate(soup.a.parents)))
    
    
    #运行结果:
    <class 'generator'>
    [(0, <p class="story">
    <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    </p>), (1, <body>
    <p class="story">
    <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    </p>
    </body>), (2, <html>
    <body>
    <p class="story">
    <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    </p>
    </body></html>), (3, <html>
    <body>
    <p class="story">
    <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    </p>
    </body></html>)]

    这里为什么出现了两个html开头的文本呢?是因为parents遍历的顺序是p—body—html—[document]

    子节点和子孙节点

    选取节点元素知乎,如果想要获取它的直接子节点,可以调用contents属性:

    html = """
    <html>
    <head>
    <title>The Dormouse's story</title>
    </head>
    <body>
    <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elise" class="sister" id="link1">
    <span>Elise</span>
    </a>
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
    and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
    and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
    """

    可以看到,返回结果是列表形式。p节点里既包含文本,又包含节点

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.contents)
    
    
    #运行结果:
    ['
        Once upon a time there were three little sisters; and their names were
        ', <a class="sister" href="http://example.com/elise" id="link1">
    <span>Elise</span>
    </a>, '
    ', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '
    and
    ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '
    and they lived at the bottom of a well.
    ']

    span节点作为p节点的孙子节点,并没有单独列出,而是包含在a中被列出,说明contents属性得到的结果是直接子节点的列表

    同样,我们可以调用children属性得到相应的结果:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.children)
    for i, child in enumerate(soup.p.children):
        print(i, child)
    
    
    #运行结果:
    <list_iterator object at 0x000000000303F7B8>
    0 
        Once upon a time there were three little sisters; and their names were
        
    1 <a class="sister" href="http://example.com/elise" id="link1">
    <span>Elise</span>
    </a>
    2 
    
    3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    4 
    and
    
    5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    6 
    and they lived at the bottom of a well.

    如果还想获得所有的子孙节点的话,可以调用descendants属性:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.descendants)
    for i, child in enumerate(soup.p.descendants):
        print(i, child)
    
    
    #运行结果:
    <generator object Tag.descendants at 0x000000000301F228>
    0 
        Once upon a time there were three little sisters; and their names were
        
    1 <a class="sister" href="http://example.com/elise" id="link1">
    <span>Elise</span>
    </a>
    2 
    
    3 <span>Elise</span>
    4 Elise
    5 
    
    6 
    
    7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    8 Lacie
    9 
    and
    
    10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    11 Tillie
    12 
    and they lived at the bottom of a well.

    遍历输出可以看到,这次输出的结果就包含了span节点,descendants会递归查询所有子节点,得到所有的子孙节点

    兄弟节点

    如果想获取兄弟节点,应该怎么办呢?

    html = """
    <html>
    <body>
    <p class="story">
                Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">
    <span>Elsie</span>
    </a>
                Hello
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
                and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
    </p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print('Next Sibling', soup.a.next_sibling)
    print('Prev Sibling', soup.a.previous_sibling)
    print('Next Siblings', list(enumerate(soup.a.next_siblings)))
    print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))
    
    
    #输出结果:
    Next Sibling 
                Hello
    
    Prev Sibling 
                Once upon a time there were three little sisters; and their names were
    
    Next Siblings [(0, '
                Hello
    '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, '
                and
    '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '
                and they lived at the bottom of a well.
    ')]
    Prev Siblings [(0, '
                Once upon a time there were three little sisters; and their names were
    ')]

    next_sibling和previous_sibling分别获取节点的下一个和上一个兄弟元素,next_siblings和previous_siblings则分别返回后面和前面的兄弟节点

    3. 方法选择器

    find_all():查询所有符合条件的元素

    html = '''
    <div class="panel">
    <div class="panel-heading">
    <h4>Hello</h4>
    </div>
    <div class="panel-body">
    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>
    </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(name='ul'))
    print(type(soup.find_all(name='ul')[0]))
    
    
    
    #运行结果:
    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>, <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>]
    <class 'bs4.element.Tag'>

    利用find_all方法查询ul节点,返回结果是列表类型,长度为2,每个元素都是bs4.element.Tag类型

    还可以进行嵌套查询,获取li节点的文本内容

    for ul in soup.find_all(name='ul'):
        print(ul.find_all(name='li'))
        for li in ul.find_all(name='li'):
            print(li.string)
    
    
    #输出结果:
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    Foo
    Bar
    Jay
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
    Foo
    Bar

    除了根据节点名查询,还可以传入一些属性来查询

    html = '''
    <div class="panel">
    <div class="panel-heading">
    <h4>Hello</h4>
    </div>
    <div class="panel-body">
    <ul class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>
    </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(attrs={'id': 'list-1'}))
    print(soup.find_all(attrs={'name': 'elements'}))
    
    
    #输出结果:
    [<ul class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>]
    [<ul class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>]

    对于一些常用的属性,比如id和class等,可以不用attrs来传递。比如,要查询id为list-1的节点,可以直接传入id这个参数。还是上面的文本,我们换一种方式来查询:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(id='list-1'))
    print(soup.find_all(class_='element'))
    
    
    #输出结果:
    [<ul class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

    text参数可以用来匹配节点的文本,传入的形式可以是字符串,可以是正则表达式对象

    html = '''
    <div class="panel">
    <div class="panel-body">
    <a>Hello, this is a link</a>
    <a>Hello, this is a link, too</a>
    </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    import re
    
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(text=re.compile('link')))
    
    
    #输出结果:
    ['Hello, this is a link', 'Hello, this is a link, too']

    find():返回单个元素,也就是第一个匹配的元素

    html = '''
    <div class="panel">
    <div class="panel-heading">
    <h4>Hello</h4>
    </div>
    <div class="panel-body">
    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>
    </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find(name='ul'))
    print(type(soup.find(name='ul')))
    print(soup.find(class_='list'))
    
    
    #输出结果:
    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>
    <class 'bs4.element.Tag'>
    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>

    其他的查询方法

    find_parents()和find_parent():前者返回所有祖先节点,后者返回直接父节点

    find_next_siblings()和find_next_sibling():前者返回后面所有的兄弟节点,后者返回后面第一个兄弟节点

    find_previous_siblings()和find_previous_sibling():前者返回前面所有的兄弟节点,后者返回前面第一个兄弟节点

    find_all_next()和find_next():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点

    find_all_previous()和find_previous():前者返回节点前所有符合条件的节点,后者返回第一个符合条件的节点

    3. CSS选择器

    使用CSS选择器时,只需要调用select()方法,传入相应的CSS选择器即可

    html = '''
    <div class="panel">
    <div class="panel-heading">
    <h4>Hello</h4>
    </div>
    <div class="panel-body">
    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>
    </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.select('.panel .panel-heading'))
    print(soup.select('ul li'))
    print(soup.select('#list-2 .element'))
    print(type(soup.select('ul')[0]))
    
    
    #输出结果:
    [<div class="panel-heading">
    <h4>Hello</h4>
    </div>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
    <class 'bs4.element.Tag'>

    嵌套选择

    select()方法同样支持嵌套选择。例如,先选择所有ul节点,再遍历每个ul节点

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for ul in soup.select('ul'):
        print(ul.select('li'))
    
    
    #输出结果:
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>]

    可以看到,这里正常输出了所有ul节点下所有li节点组成的列表

    获取属性

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for ul in soup.select('ul'):
        print(ul['id'])
        print(ul.attrs['id'])
    
    
    #输出结果:
    list-1
    list-1
    list-2
    list-2

    可以看到,直接传入中括号和属性名,或通过attrs属性获取属性值,都可以成功

    获取文本

    要获取文本,可以用前面所讲的string属性或者get_text()方法

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for li in soup.select('li'):
        print('Get Text:', li.get_text())
        print('String:', li.string)
    
    
    #输出结果:
    Get Text: Foo
    String: Foo
    Get Text: Bar
    String: Bar
    Get Text: Jay
    String: Jay
    Get Text: Foo
    String: Foo
    Get Text: Bar
    String: Bar
  • 相关阅读:
    Java的字符串及格式化输入输出
    Java的数据类型与类型转换
    java基本程序
    svn基础入门
    github基础入门笔记
    git基础入门笔记
    linux基础入门笔记
    二、FreeMarker 模版开发指南 第二章 数值和类型
    【CodeForces】[599B]Spongebob and Joke
    【CodeForces】[612B]HDD is Outdated Technology
  • 原文地址:https://www.cnblogs.com/my_captain/p/11068532.html
Copyright © 2011-2022 走看看