zoukankan      html  css  js  c++  java
  • BeautifulSoup解析库详解

    BeautifulSoup是灵活又方便的网页解析库,处理高效,支持多种解析器

    利用它不用编写正则表达式即可方便地实现网页信息的提取

    安装:pip3 install beautifulsoup4

    用法详解:

    beautifulsoup支持的一些解析库

    解析器 使用方法 优势 劣势
    Python标准库 BeautifulSoup(makeup,"html.parser") python的内置标准库,执行速度适中,文档容错能力强 python2.7 or python3.2.2前的版本中文容错能力差
    lxml HTML解析器 BeautifulSoup(makeup,"lxml") 速度快,文档容错能力强 需要安装c语言库
    lxml XML解析器 BeautifulSoup(makeup,"xmlr") 速度快,唯一支持xml的解析器 需要安装c语言库
    html5lib BeautifulSoup(makeup,"html5lib") 最好的容错性,以浏览器的方式解析文档,生成HTML5格式的文档 速度慢,不依赖外部扩展

    基本使用方法:

    import bs4
    from bs4 import BeautifulSoup
    
    #下面是一段不完整的 html代码
    html = '''
    <html><head><title>The Demouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Domouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters,and their name were
    <a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
    <a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
    <a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
    and they lived the bottom of a wall</p>
    <p clas="stuy">..</p>
    '''
    
    soup = BeautifulSoup(html,'lxml')
    
    #将代码补全,也就是容错处理
    print(soup.prettify())
    
    #选择title这个标签,并打印内容
    print(soup.title.string)
    输出结果为: <html> <head> <title> The Demouse's story </title> </head> <body> <p class="title" name="dromouse"> <b> The Domouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters,and their name were <a class="sister" href="http://examlpe.com/elele" ld="link1"> <!--Elsle--> </a> <a class="sister" href="http://examlpe.com/lacie" ld="link2"> <!--Elsle--> </a> <a class="sister" href="http://examlpe.com/title" ld="link3"> <title> </title> </a> and they lived the bottom of a wall </p> <p clas="stuy"> .. </p> </body> </html> The Demouse's story

    标签选择器

    如上面例程中的soup.title.string,就是选择了title标签

    选择元素:import bs4

    from bs4 import BeautifulSoup
    
    #下面是一段不完整的 html代码
    html = '''
    <html><head><title>The Demouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Domouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters,and their name were
    <a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
    <a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
    <a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
    and they lived the bottom of a wall</p>
    <p clas="stuy">..</p>
    '''
    
    soup = BeautifulSoup(html,'lxml')
    print(soup.title)
    print(type(soup.title))
    print(soup.head)
    print(soup.p)
    输出结果为:
    <title>The Demouse's story</title>
    <class 'bs4.element.Tag'>
    <head><title>The Demouse's story</title></head>
    <p class="title" name="dromouse"><b>The Domouse's story</b></p>
    #只输出第一个匹配结果

    获取名称:

    import bs4
    from bs4 import BeautifulSoup
    
    #下面是一段不完整的 html代码
    html = '''
    <html><head><title>The Demouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Domouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters,and their name were
    <a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
    <a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
    <a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
    and they lived the bottom of a wall</p>
    <p clas="stuy">..</p>
    '''
    
    soup = BeautifulSoup(html,'lxml')
    print(soup.title.name)
    输出结果为:title 

    获取属性: 

    import bs4
    from bs4 import BeautifulSoup
    
    #下面是一段不完整的 html代码
    html = '''
    <html><head><title>The Demouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Domouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters,and their name were
    <a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
    <a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
    <a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
    and they lived the bottom of a wall</p>
    <p clas="stuy">..</p>
    '''
    
    soup = BeautifulSoup(html,'lxml')
    print(soup.p.attrs['name'])
    print(soup.p['name'])
    #注意soup.a.attrs或者soup.p['name']这两种获取属性的方法都是可以的
    #还有就是要注意中括号!!!

    获取内容:

    如例程中所示,使用string方法,如:soup.title.string即可获取内容

    嵌套选择:

    如:print(soup.head.title.string)

    子节点和子孙节点:

    如:print(soup.p.contents)使用contents可以获取p标签的所有子节点,类型是一个列表 

    也可以使用children,与contents不同的是,children是一个迭代器,获取所有子节点,需要使用循环才能把他的内容取到如:

    print(soup.p.children)

    for i ,child in enumerate(soup.p.children):

      print(i,child)

    此外还有一个属性descendants,这个是获取所有的子孙节点,同样也是一个迭代器 

    print(soup.p.descendants)

    for i ,child in enumerate(soup.p.descendants):

      print(i,child)

    注意:子节点,子孙节点和下面的父节点,祖先节点中使用的类似于soup.p语法,是获取第一个匹配到的p标签,所以这些节点也都是第一个匹配到的标签所对应的节点

    父节点和祖先节点:

    parent属性:获取所有的父节点

    parents属性:获取所有的祖先节点

    兄弟节点:

    next_siblings属性

    previous_siblings属性

    --------------------------------------------------------------------------------------------------------------------

    标准选择器

    上面说的是标签选择器,速度比较快,但是不能满足解析html文档的需求的

    find_all方法:

    find_all(name,attrs,recursive,text,**kwargs)

    可根据标签名、属性、内容查找文档

    根据name进行查找:

    import bs4
    from bs4 import BeautifulSoup
    
    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>hello</h4>
        </div>
        <div class="panel-body">
            <url class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">jay</li>
            </url>
            <url class="list list-small" id="list-2">
                <li lass="element">Foo</li>
                <li lass="element">Bar</li>
            </url>
        </div>
        </div>
    '''
    
    soup = BeautifulSoup(html,'lxml')
    print(soup.find_all('url'))
    输出结果为:
    [<url class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">jay</li>
    </url>, <url class="list list-small" id="list-2">
    <li lass="element">Foo</li>
    <li lass="element">Bar</li>
    </url>]
    

     返回结果可以看到为一个列表,可以对列表进行循环,然后对每一项元素进行查找,如:

    import bs4
    from bs4 import BeautifulSoup
    
    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>hello</h4>
        </div>
        <div class="panel-body">
            <url class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">jay</li>
            </url>
            <url class="list list-small" id="list-2">
                <li lass="element">Foo</li>
                <li lass="element">Bar</li>
            </url>
        </div>
        </div>
    '''
    
    soup = BeautifulSoup(html,'lxml')
    for url in soup.find_all('url'):
        print(url.find_all('li'))
    
    输出结果为:
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>]
    [<li lass="element">Foo</li>, <li lass="element">Bar</li>]  

     根据attrs进行查找:

    attrs传入的参数为字典形式的参数,如:

    import bs4
    from bs4 import BeautifulSoup
    
    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>hello</h4>
        </div>
        <div class="panel-body">
            <url class="list" id="list-1" name='elements'>
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">jay</li>
            </url>
            <url class="list list-small" id="list-2">
                <li lass="element">Foo</li>
                <li lass="element">Bar</li>
            </url>
        </div>
        </div>
    '''
    
    soup = BeautifulSoup(html,'lxml')
    
    print(soup.find_all(attrs={'id':'list-1'}))#也可以soup.find_all(id='list-1')这样的来进行查找
    print(soup.find_all(attrs={'name':'elements'}))
    输出结果为:
    [<url class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">jay</li>
    </url>]
    [<url class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">jay</li>
    </url>]
    

    ###注意:可以利用soup.find_all(id='list-1')这样的来进行查找,但对于class属性,需要写成class_='内容'的形式,因为在python中,class是一个关键字,所以在这里当作属性进行查找的时候,需要写成class_的样子

    利用text进行查找:

    import bs4
    from bs4 import BeautifulSoup
    
    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>hello</h4>
        </div>
        <div class="panel-body">
            <url class="list" id="list-1" name='elements'>
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">jay</li>
            </url>
            <url class="list list-small" id="list-2">
                <li lass="element">Foo</li>
                <li lass="element">Bar</li>
            </url>
        </div>
        </div>
    '''
    
    soup = BeautifulSoup(html,'lxml')
    
    print(soup.find_all(text='Foo'))
    输出结果为:
    ['Foo', 'Foo'] 

    find方法,用法跟find_all方法是完全一样的,只不过find_all返回所有元素,是一个列表,find返回单个元素,列表中的第一个值

    find(name,attrs,recurslve,text,**kwargs)

    find_parents()

    find_parent()

    find_next_siblings()

    find_next_sibling()

    find_previous_siblings()

    find_previous_sibling()

    find_all_next()

    find_next()

    find_all_previous()

    find_previous()

    这些函数的用法都一样,只不过实现的方式不一样

    css选择器

    通过select()直接传入css选择器即可完成选择

    import bs4
    from bs4 import BeautifulSoup
    
    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>hello</h4>
        </div>
        <div class="panel-body">
            <url class="list" id="list-1" name='elements'>
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">jay</li>
            </url>
            <url class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </url>
        </div>
        </div>
    '''
    
    soup = BeautifulSoup(html,'lxml')
    
    #如果选择的是class,需要加上一个点,.panel .panel-heading
    print(soup.select('.panel .panel-heading'))
    #直接选择标签
    print(soup.select('url li'))
    #选择id,要用#来选
    print(soup.select('#list-2 .element'))
    输出结果为:
    [<div class="panel-heading">
    <h4>hello</h4>
    </div>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
    

     进行层层嵌套的选择:

    import bs4
    from bs4 import BeautifulSoup
    
    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>hello</h4>
        </div>
        <div class="panel-body">
            <url class="list" id="list-1" name='elements'>
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">jay</li>
            </url>
            <url class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </url>
        </div>
        </div>
    '''
    
    soup = BeautifulSoup(html,'lxml')
    
    for url in soup.select('url'):
        print(url.select('li'))
    输出结果为:
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
    

     获取属性 

     

    import bs4
    from bs4 import BeautifulSoup
    
    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>hello</h4>
        </div>
        <div class="panel-body">
            <url class="list" id="list-1" name='elements'>
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">jay</li>
            </url>
            <url class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </url>
        </div>
        </div>
    '''
    
    soup = BeautifulSoup(html,'lxml')
    
    for url in soup.select('url'):
        print(url['id'])
       #也可以使用print(url.attrs['id']) 输出结果为: list-1 list-2

     获取内容:

    import bs4
    from bs4 import BeautifulSoup
    
    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>hello</h4>
        </div>
        <div class="panel-body">
            <url class="list" id="list-1" name='elements'>
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">jay</li>
            </url>
            <url class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </url>
        </div>
        </div>
    '''
    
    soup = BeautifulSoup(html,'lxml')
    
    for l in soup.select('li'):
        print(l.get_text())
    输出结果为:
    Foo
    Bar
    jay
    Foo
    Bar
    

      

    总结:

    推荐使用lxml解析库,必要时使用html.parser

    标签选择筛选功能弱但是速度快

    建议使用find(),find_all()查询匹配单个结果或多个结果

    如果对css选择器熟悉建议使用select()

    记住常用的获取属性和文本值的方法

     

     

     

      

    三样东西有助于缓解生命的疲劳:希望、睡眠和微笑。---康德
  • 相关阅读:
    单点登录实现机制:web-sso
    阿里云API网关(16)客户端请求的https支持
    阿里云API网关(15)监控预警
    BZOJ1802: [Ahoi2009]checker(性质分析 dp)
    LOJ#505. 「LibreOJ β Round」ZQC 的游戏(最大流)
    LOJ#6085. 「美团 CodeM 资格赛」优惠券(set)
    洛谷P3924 康娜的线段树(期望 前缀和)
    BZOJ2337: [HNOI2011]XOR和路径(期望 高斯消元)
    2016计蒜之道复赛 百度地图的实时路况(Floyd 分治)
    洛谷P2881 [USACO07MAR]排名的牛Ranking the Cows(bitset Floyd)
  • 原文地址:https://www.cnblogs.com/ronghe/p/9182537.html
Copyright © 2011-2022 走看看