zoukankan      html  css  js  c++  java
  • Beautiful Soup的简介(1)

    解析器使用方法优势劣势
    Python标准库 BeautifulSoup(markup, “html.parser”)
    • Python的内置标准库
    • 执行速度适中
    • 文档容错能力强
    • Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差
    lxml HTML 解析器 BeautifulSoup(markup, “lxml”)
    • 速度快
    • 文档容错能力强
    • 需要安装C语言库
    lxml XML 解析器 BeautifulSoup(markup, [“lxml”, “xml”])BeautifulSoup(markup, “xml”)
    • 速度快
    • 唯一支持XML的解析器
    • 需要安装C语言库
    html5lib BeautifulSoup(markup, “html5lib”)
    • 最好的容错性
    • 以浏览器的方式解析文档
    • 生成HTML5格式的文档
    • 速度慢
    • 不依赖外部扩展

    基本使用

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')        # 使用lxml解析器创建对本地html文件创建对象
    print(soup.prettify())            # 补全文件中未闭合的标签
    print(soup.title.string)           # 打印出The Dormouse's story

    标签选择器

    选择元素

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html5lib')        # 使用html5lib解析器创建对本地html文件创建对象
    print(soup.title)
    print(type(soup.title))
    print(soup.head)
    print(soup.p)

    <title>The Dormouse's story</title>
    <class 'bs4.element.Tag'>
    <head><title>The Dormouse's story</title></head>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    以p标签为例,这种选择的方法只会选择第一个标签,其他的标签不会显示出来

    获取名称

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html5lib')        # 使用html5lib解析器创建对本地html文件创建对象
    print(soup.title.name)                # 这里只显示最外层的标签名称

    title

    获取属性

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html5lib')        # 使用html5lib解析器创建对本地html文件创建对象
    print(soup.p.attrs['name'])
    print(soup.p['name'])

    dromouse

    dromouse


    获取内容

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html5lib')        # 使用html5lib解析器创建对本地html文件创建对象
    print(soup.p.string)
    The Dormouse's story

    嵌套选择

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html5lib')        # 使用html5lib解析器创建对本地html文件创建对象
    print(soup.head.title.string)
    The Dormouse's story

    子节点和子孙节点

    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html5lib')        # 使用lxml解析器创建对本地html文件创建对象
    print(soup.p.contents)                # contents来选择子节点,输出列表的方式,都放在一个列表内

    [' Once upon a time there were three little sisters; and their names were ', <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>, ' ', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '; and they lived at the bottom of a well. ']

    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html5lib')        # 使用lxml解析器创建对本地html文件创建对象
    print(soup.p.children)        # 同样都是获取子节点.children是迭代器的形式,不同于contens(放置在一个列表内),只能用循环的方式才能列出节点内容
        for i, child in enumerate(soup.p.children):
        print(i, child)

    <list_iterator object at 0x104858a90>
    0
    Once upon a time there were three little sisters; and their names were

    1 <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    2

    3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    4
    and

    5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    6 ;
    and they lived at the bottom of a well.

    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
        </body>
    </html>
    """
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html5lib')
    print(soup.p.descendants)
    for i, child in enumerate(soup.p.descendants):    #.descendants类似child的方式,同样都是一个迭代的方式,不同的是将子孙节点也分类得更清楚
        print(i, child)
    
    <generator object descendants at 0x10744f0a0>
    0 
                Once upon a time there were three little sisters; and their names were
                
    1 <a class="sister" href="http://example.com/elsie" id="link1">
                    <span>Elsie</span>
                </a>
    2 
                    
    3 <span>Elsie</span>
    4 Elsie
    5 
                
    6 
                
    7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    8 Lacie
    9 
                and
                
    10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    11 Tillie
    12 
                and they lived at the bottom of a well.

    父节点和祖先节点
    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
        </body>
    </html>
    """
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html5lib')
    print(soup.a.parent)        #使用parent获取父节点的内容
    
    <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
                    <span>Elsie</span>
                </a>
                <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
        </body>
    </html>
    """
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html5lib')
    print(list(enumerate(soup.a.parents)))    #使用枚举的方式列出所有的,一层一层地向外包围
    
    [(0, <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
                    <span>Elsie</span>
                </a>
                <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>), (1, <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
                    <span>Elsie</span>
                </a>
                <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
        
    
    </body>), (2, <html><head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
                    <span>Elsie</span>
                </a>
                <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
        
    
    </body></html>), (3, <html><head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
                    <span>Elsie</span>
                </a>
                <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
        
    
    </body></html>)]

    兄弟节点与其并列的标签
    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
        </body>
    </html>
    """
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html5lib')
    print(list(enumerate(soup.a.next_siblings))) #后面的兄弟节点
    print(list(enumerate(soup.a.previous_siblings)))  #前面的兄弟节点
    
    [(0, '
                '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, '
                and
                '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '
                and they lived at the bottom of a well.
            ')]
    [(0, '
                Once upon a time there were three little sisters; and their names were
                ')]

     

     
  • 相关阅读:
    docker创建tomcat容器
    【转载】张一鸣:为什么 BAT 挖不走我们的人才?
    Elastic认证考试,请先看这一篇
    vs code 初始化vue项目框架
    Idea集成git常用命令
    pxc搭建mysql集群
    mysql无限级分类
    Java面试题大全
    SpringMVC和Spring
    Redis高级特性
  • 原文地址:https://www.cnblogs.com/ecwork/p/7456350.html
Copyright © 2011-2022 走看看