zoukankan      html  css  js  c++  java
  • Python爬虫之BeautifulSoup库

    1. BeautifulSoup

    1.1 解析库

    1)Python标准库

    # 使用方法
    BeautifulSoup(markup, "html.parser")
    
    # 优势
    Python的内置标准库,执行速度适中,文档容错能力强
    
    # 劣势
    Python2.7.3 或者 python3.2.2 前的版本容错能力差

    2)lxml HTML解析器

    • 绝大部分场景都应该使用lxml解析器
    # 使用方法
    BeautifulSoup(markup, "lxml")
    
    # 优势
    速度快,文档容错能力强
    
    # 劣势
    需要安装C语言库

    3)lxml XML解析器

    # 使用方法
    BeautifulSoup(markup, "xml")
    
    # 优势
    速度快,唯一支持XML的解析器
    
    # 劣势
    需要安装C语言库

    4)html5lib

    # 使用方法
    BeautifulSoup(markup, "html5lib")
    
    # 优势
    最好的容错性,以浏览器的方式解析文档,生成HTML5格式的文档
    
    # 劣势
    速度慢,不依赖外部扩展

    1.2 基本使用

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml') # 使用lxml解析器
    print(soup.prettify())    # 格式化代码,能自动将缺失的代码进行补全并进行容错处理
    print(soup.title.string)  # 拿到title标签,并拿到其中的内容

    2. 标签选择器

    2.1 选择元素

    可以直接通过  .标签名 的方式来选择标签

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.title)        # 选择title标签,打印结果:<title>The Dormouse's story</title>
    print(type(soup.title))  # 类型:<class 'bs4.element.Tag'>
    print(soup.head) 
    print(soup.p) # 如果有多个匹配结果,那么它只会返回第一个

    2.2 获取名称

    获取标签的名称,如是p标签还是a标签等

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.title.name) # 获取标签名称

    2.3 获取属性

    可以通过 attrs["name"] 或者 标签["name"] 的方式来获取标签中name属性的值

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.attrs['name'])   # 获取p标签中name属性的值
    print(soup.p['name'])         # 这样也可以获取

    2.4 获取内容

    可以通过 标签.string 的方式来获取标签中的内容

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p clss="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.string)  # 获取p标签中的内容(只是获取字符内容):The Dormouse's story

    2.5 嵌套选择

    可以通过点  .  的方式来嵌套选择

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.head.title.string)  # 获取head下面的title中的字符内容

    2.6 子节点和子孙节点

    1)子节点

    • 通过  标签.contents 可以获取标签中的所有子节点,保存为一个列表
    • 保存为列表
    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.contents)  # 获取p标签中的所有子节点,保存为一个列表
    • 可以通过  标签.children  来获取标签中的所有子节点,保存为一个迭代器
    • 保存为迭代器
    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.children)  # 获取p标签中的所有子节点,保存为一个迭代器
    for i, child in enumerate(soup.p.children):
        print(i, child)

    2)子孙节点

    • 可以通过  标签.descendants  来获取标签中的所有子孙节点,并保存为一个迭代器
    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.descendants)   # 获取p标签中的所有子孙节点,保存为一个迭代器
    for i, child in enumerate(soup.p.descendants):
        print(i, child)

    2.7 父节点和祖先节点

    1)父节点

    • 通过  标签.parent  可以获取标签的父节点
    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.a.parent)  # 获取a标签的父节点

    2)祖先节点

    • 通过  标签.parents  可以获取标签的所有祖先节点
    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(list(enumerate(soup.a.parents)))   # 获取a标签所有的祖先节点

    2.8 兄弟节点

    • 通过  标签.next_siblings  可以获取标签后面的所有兄弟节点
    • 通过  标签.previous_siblings  可以获取标签前面的所有兄弟节点
    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(list(enumerate(soup.a.next_siblings)))     # 获取a标签后面的所有兄弟节点
    print(list(enumerate(soup.a.previous_siblings))) # 获取a标签前面的所有兄弟节点

    3. 标准选择器

    3.1 find_all()

    • 使用语法:find_all(name, attrs, recursive, text, **kwargs)

    1)name

    • 根据标签名来选择标签
    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup1 = BeautifulSoup(html, 'lxml')
    print(soup1.find_all('ul'))  # 找到所有匹配的结果,并以列表的形式返回
    print(type(soup1.find_all('ul')[0]))
    
    soup2 = BeautifulSoup(html, 'lxml')
    for ul in soup2.find_all('ul'):
    print(ul.find_all('li'))

    2)attrs

    • 根据标签中的属性进行选择标签
    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1" name="elements">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(attrs={'id': 'list-1'}))    # 找到所有的标签属性中id=list-1的标签
    print(soup.find_all(attrs={'name': 'elements'}))
    
    soup2 = BeautifulSoup(html, 'lxml')
    print(soup2.find_all(id='list-1'))      # 找到所有的标签属性中id=list-1的标签,和attrs类似,只不过不需要再传入字典了
    print(soup2.find_all(class_='element')) # 如果和关键字冲突,则可以通过将属性后面加一个下划线,如class_

    3)text

    • 根据文本的内容进行选择
    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(text='Foo'))   # 根据文本的内容进行选择,选择文本中包含Foo的标签的所有内容

    3.2 find()

    • find返回单个元素,find_all返回所有元素
    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find('ul'))   # 找到第一个ul标签
    print(type(soup.find('ul')))
    print(soup.find('page'))

    3.3 find_parents() find_parent()

    find_parents() 返回所有祖先节点,find_parent() 返回直接父节点。

    3.4 find_next_siblings() find_next_sibling()

    find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。

    3.5 find_previous_siblings() find_previous_sibling()

    find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。

    3.6 find_all_next() find_next()

    find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点。

    3.7 find_all_previous() 和 find_previous()

    find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点。

    4. CSS选择器

    4.1 css选择器基本使用

    通过select() 直接传入CSS选择器即可完成选择

    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.select('.panel .panel-heading'))  # 这是类选择器,class=xxx,中间的空格表示这是也是层级选择器
    print(soup.select('ul li'))                  # 这是标签选择器,选择具体的标签,这里表示选择ul标签中的li标签
    print(soup.select('#list-2 .element'))       # 这个id选择器,id=xxx
    print(type(soup.select('ul')[0]))
    
    soup2 = BeautifulSoup(html, 'lxml')
    for ul in soup2.select('ul'):
    print(ul.select('li'))

    4.2 获取属性

    • TAG['id']
    • TAG.attr['id']
    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for ul in soup.select('ul'):
        print(ul['id'])         # 获取ul标签中id属性的值
        print(ul.attrs['id'])   # 这两种写法等价

    4.3 获取内容

    • TAG.get_text()
    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for li in soup.select('li'):
        print(li.get_text())   # 获取标签中的文本

    5. 总结

    1. 推荐使用 lxml 解析库,必要时使用 html.parser
    2. 标签选择筛选功能弱但是速度快
    3. 建议使用find()、find_all() 查询匹配单个结果或者多个结果
    4. 如果对CSS选择器熟悉建议使用select()
    5. 要记住常用的获取属性和文本值的方法

  • 相关阅读:
    阅读计划
    第一课 课堂练习总结
    人月神话读后感
    软件工程概论11-软件演化
    【HDU4366】【DFS序+分块】Successor
    【转载】【元胞自动机】生命游戏(时间简史)
    【BZOJ2741】【块状链表+可持久化trie】FOTILE模拟赛L
    【BZOJ3295】【块状链表+树状数组】动态逆序对
    【HDU4391】【块状链表】Paint The Wall
    【POJ2887】【块状链表】Big String
  • 原文地址:https://www.cnblogs.com/hgzero/p/14132992.html
Copyright © 2011-2022 走看看