zoukankan      html  css  js  c++  java
  • BeautifulSoup模块

    1.BeautifulSoup模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。
    2.安装BeautifulSoup模块
    pip3 install beautifulsoup4
    3.使用方式
    创建html

    html_doc ="""
                <html>
                    <head>
                        <title>BeautifulSoup示例</title>
                    </head>
                <body>
                    <div>
                        <a href='http://www.dongdong.com'>东东<p>东东内容</p></a>
                    </div>
                    <a id='xixi'>西西</a>
                    <div>
                        <p>南南内容</p>
                    </div>
                    <p>北北内容</p>
                </body>
                </html>
            """

    创建beautifulsoup对象

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")    #soup是整个html
    print(soup.prettify())                                    #打印soup对象的内容,格式化输出

    name标签名称
    (1)通过soup对象找到所有a标签

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")    #soup是整个html
    tag = soup.find('a')                                      #找到a标签
    print(tag)

    输出:
    <a href="http://www.dongdong.com">东东<p>东东内容</p></a>
    (2)通过a标签找到a标签的名称

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")    #soup是整个html
    tag = soup.find('a')                                      #找到a标签                                              
    name = tag.name                                           #获取a标签的名称

    输出:
    a
    (3)通过a标签的名称修改a标签的名称

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")      #soup是整个html
    tag = soup.find('a')                                      #找到a标签                                              
    name = tag.name                                           #获取a标签的名字                                               
    tag.name = 'span'                                         #把a标签的名称改为span
    print(tag)

    输出:
    <span href="http://www.dongdong.com">东东<p>东东内容</p></span>
    attr标签属性
    (1)通过attrs获取a标签属性

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    tag = soup.find('a')
    attrs = tag.attrs              #获取属性
    print(attrs)

    输出:
    {'href': 'http://www.dongdong.com'}
    (2)通过attrs修改a标签属性

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    tag = soup.find('a')
    attrs = tag.attrs                                   #获取属性
    tag.attrs = {'href':'http://www.nannan.com'}   #修改属性
    print(tag)

    输出:
    <a href="http://www.nannan.com">东东<p>东东内容</p></a>
    (3)通过attrs给标签里添加属性love="石头"

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")      
    tag = soup.find('a')                                      
    tag.attrs['love'] = '石头'
    print(tag)

    输出:
    <a href="http://www.dongdong.com" love="石头">东东<p>东东内容</p></a>
    (4)通过attrs把a标签里的属性href删除掉

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    tag = soup.find('a')
    attrs = tag.attrs                                   #获取属性
    del tag.attrs['href']
    print(tag)

    输出:
    <a>东东<p>东东内容</p></a>
    标签和内容
    (1)通过children找所有body里所有子标签

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    tags = soup.find('body').children
    print(list(tags))

    输出:
    [' ', <div>
    <a href="http://www.dongdong.com">东东<p>东东内容</p></a>
    </div>, ' ', <a id="xixi">西西</a>, ' ', <div>
    <p>南南内容</p>
    </div>, ' ', <p>北北内容</p>, ' ']
    (2)通过children找所有body里所有子标签,再通过tags把每一个标签拿到再通过type(tag)把标签和内容分别取出来

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    tags = soup.find('body').children      ###通过tags把每一个标签拿到再通过type(tag)把标签和内容分别取出来
    from bs4.element import Tag
    for tag in tags:
        if type(tag) == Tag:         #判断如果type(tag) == Tag是标签
            print('我是标签:',tag, type(tag))
        else:                       #否则是文本
            print('文本....')

    输出:
    文本....
    我是标签: <div>
    <a href="http://www.dongdong.com">东东<p>东东内容</p></a>
    </div> <class 'bs4.element.Tag'>
    文本....
    我是标签: <a id="xixi">西西</a> <class 'bs4.element.Tag'>
    文本....
    我是标签: <div>
    <p>南南内容</p>
    </div> <class 'bs4.element.Tag'>
    文本....
    我是标签: <p>北北内容</p> <class 'bs4.element.Tag'>
    文本....
    (3)通过descendants找所有body里所有子子孙孙标签(递归一个一个找)

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    tags = soup.find('body').descendants
    print(list(tags))

    输出:
    [' ', <div>
    <a href="http://www.dongdong.com">东东<p>东东内容</p></a>
    </div>, ' ', <a href="http://www.dongdong.com">东东<p>东东内容</p></a>, '东东', <p>东东内容</p>, '东东内容', ' ', ' ', <a id="xixi">西西</a>, '西西', ' ', <div>
    <p>南南内容</p>
    </div>, ' ', <p>南南内容</p>, '南南内容', ' ', ' ', <p>北北内容</p>, '北北内容', ' ']
    (4)通过把body标签里面的孩子都清空(保留标签名)

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    tag = soup.find('body')
    tag.clear()
    print(soup)

    输出:
    <html>
    <head>
    <title>BeautifulSoup示例</title>
    </head>
    <body></body>
    </html>
    (5)decompose递归的删除所有的标签

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    body = soup.find('body')
    body.decompose()
    print(soup)

    输出:
    <html>
    <head>
    <title>BeautifulSoup示例</title>
    </head>
    </html>
    (6)extract递归的删除所有的标签,并获取删除的标签

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    body = soup.find('body')
    v = body.extract()
    print(v)

    输出:
    <body>
    <div>
    <a href="http://www.dongdong.com">东东<p>东东内容</p></a>
    </div>
    <a id="xixi">西西</a>
    <div>
    <p>南南内容</p>
    </div>
    <p>北北内容</p>
    </body>
    (7)decode把对象转换为字符串(含当前标签)

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    body = soup.find('body')
    v = body.decode()
    print(v)

    输出:
    <body>
    <div>
    <a href="http://www.dongdong.com">东东<p>东东内容</p></a>
    </div>
    <a id="xixi">西西</a>
    <div>
    <p>南南内容</p>
    </div>
    <p>北北内容</p>
    </body>
    (8)decode_contents把对象转换为字符串(不含当前标签)

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    body = soup.find('body')
    v = body.decode_contents()
    print(v)

    输出:
    <div>
    <a href="http://www.dongdong.com">东东<p>东东内容</p></a>
    </div>
    <a id="xixi">西西</a>
    <div>
    <p>南南内容</p>
    </div>
    <p>北北内容</p>
    (10)find获取匹配的第一个标签

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    tag = soup.find('body').find('p',recursive=False)                            #recursive=True是否递归去找
    print(tag)

    输出:
    <p>北北内容</p>
    (11)get_text获取标签内部文本内容

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    tag = soup.find('a')
    print(tag)
    v = tag.get_text()
    print(v)

    输出:
    <a href="http://www.dongdong.com">东东<p>东东内容</p></a>
    东东东东内容
    (12)index检查标签在某标签中的索引位置

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    tag = soup.find('body')
    v = tag.index(tag.find('div'))
    print(v)

    输出:
    1
    (13)index检查标签在某标签中的索引位置

    tag = soup.find('body')
    for i,v in enumerate(tag):
        print(i,v)

    输出:
    0
    1 <div>
    <a href="http://www.dongdong.com">东东<p>东东内容</p></a>
    </div>
    2
    3 <a id="xixi">西西</a>
    4
    5 <div>
    <p>南南内容</p>
    </div>
    6
    7 <p>北北内容</p>
    8
    (14)append在当前标签内部追加一个标签

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    from bs4.element import Tag
    obj = Tag(name='i',attrs={'id': 'it'})
    obj.string = '我是一个新来的'
    tag = soup.find('body')
    tag.append(obj)
    print(soup)

    输出:
    <html>
    <head>
    <title>BeautifulSoup示例</title>
    </head>
    <body>
    <div>
    <a href="http://www.dongdong.com">东东<p>东东内容</p></a>
    </div>
    <a id="xixi">西西</a>
    <div>
    <p>南南内容</p>
    </div>
    <p>北北内容</p>
    <i id="it">我是一个新来的</i></body>
    </html>
    (15)insert在当前标签内部指定位置插入一个标签

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    from bs4.element import Tag
    obj = Tag(name='i', attrs={'id': 'it'})
    obj.string = '我是一个新来的'
    tag = soup.find('body')
    tag.insert(2, obj)
    print(soup)

    输出:
    <html>
    <head>
    <title>BeautifulSoup示例</title>
    </head>
    <body>
    <div>
    <a href="http://www.dongdong.com">东东<p>东东内容</p></a>
    </div><i id="it">我是一个新来的</i>
    <a id="xixi">西西</a>
    <div>
    <p>南南内容</p>
    </div>
    <p>北北内容</p>
    </body>
    </html>
    (16)replace_with 在当前标签替换为指定标签

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, features="html.parser")
    
    from bs4.element import Tag
    obj = Tag(name='i', attrs={'id': 'it'})
    obj.string = '我是一个新来的'
    tag = soup.find('div')
    tag.replace_with(obj)
    print(soup)

    输出:
    <html>
    <head>
    <title>BeautifulSoup示例</title>
    </head>
    <body>
    <i id="it">我是一个新来的</i>
    <a id="xixi">西西</a>
    <div>
    <p>南南内容</p>
    </div>
    <p>北北内容</p>
    </body>
    </html>

  • 相关阅读:
    HDU 1010 Tempter of the Bone
    HDU 4421 Bit Magic(奇葩式解法)
    HDU 2614 Beat 深搜DFS
    HDU 1495 非常可乐 BFS 搜索
    Road to Cinema
    Sea Battle
    Interview with Oleg
    Spotlights
    Substring
    Dominating Patterns
  • 原文地址:https://www.cnblogs.com/xixi18/p/10966221.html
Copyright © 2011-2022 走看看