zoukankan      html  css  js  c++  java
  • 爬虫——BeautifulSoup4解析器

    BeautifulSoup用来解析HTML比较简单,API非常人性化,支持CSS选择器、Python标准库中的HTML解析器,也支持lxml的XML解析器。

    其相较与正则而言,使用更加简单。

    示例:

    首先必须要导入bs4库

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    # 格式化输出 soup 对象的内容
    print(soup.prettify())
    

    运行结果

    <html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="title" name="dromouse">
       <b>
        The Dormouse's story
       </b>
      </p>
      <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">
        <!-- Elsie -->
       </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>
    

    四大对象种类

    BeautifulSoup将复杂的HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

    • Tag
    • NavigableString
    • BeautifulSoup
    • Comment

    1.Tag

    Tag 通俗点讲就是HTML中的一个个标签,例如:

    <head><title>The Dormouse's story</title></head>
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    

    上面title head a p 等等HTML标签加上里面包括的内容就是Tag,那么试着使用BeautifulSoup来获取Tags:

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    # # 打印title标签
    print(soup.title)
    
    # 打印head标签
    print(soup.head)
    
    # 打印a标签
    print(soup.a)
    
    # 打印p标签
    print(soup.p)
    
    # 打印soup.p的类型
    print(type(soup.p))
    

    运行结果

    <title>The Dormouse's story</title>
    <head><title>The Dormouse's story</title></head>
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <class 'bs4.element.Tag'>
    

    我们可以利用soup加标签名轻松地获取这些标签内容,这些对象的类型是bs4.element.Tag。但是注意,它查找的是在所有内容中的第一个符合要求的标签。如果需要查询所有的标签,后面会进行介绍。

    对于Tag,它有两个重要的属性,就是name和attrs

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    # soup对象比较特殊,它的name为[document]
    print(soup.name)
    
    # 对于其他内部标签,输出的值便为标签本身的名称
    print(soup.head.name)
    
    # 打印p标签的所有属性,其类型是一个字典
    print(soup.p.attrs)
    
    # 打印p标签的class属性
    print(soup.p['class'])
    # 还可以利用get方法获取属性,传入属性的名称,与上面的方法等价
    print(soup.p.get('class'))
    
    print(soup.p)
    
    # 修改属性
    soup.p['class'] = "newClass"
    print(soup.p)
    
    # 删除属性
    del soup.p['class']
    print(soup.p)
    

    运行结果

    [document]
    head
    {'class': ['title'], 'name': 'dromouse'}
    ['title']
    ['title']
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
    <p name="dromouse"><b>The Dormouse's story</b></p>
    

    2.NavigableString

    既然我们已经得到了标签的内容,那么问题来了,我们想要获取标签内部的文字怎么办呢?很简单,用.string即可,例如:

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    # 打印p标签的内容
    print(soup.p.string)
    
    # 打印soup.p.string的类型
    print(type(soup.p.string))
    

    运行结果

    The Dormouse's story
    <class 'bs4.element.NavigableString'>
    

    3.BeautifulSoup

    BeautifulSoup对象表示的是一个文档的内容。大部分时候,可以把它当作Tag对象,是一个特殊的Tag,我们可以分别获取它的类型,名称,以及属性

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    # 类型
    print(type(soup.name))
    
    # 名称
    print(soup.name)
    
    # 属性
    print(soup.attrs)
    

    运行结果

    <class 'str'>
    [document]
    {}
    

    4.Comment

    Comment对象是一个特殊类型的NavigableString对象,其输出的内容不包括注释符号。

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.a)
    
    print(soup.a.string)
    
    print(type(soup.a.string))
    

    运行结果

    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
     Elsie 
    <class 'bs4.element.Comment'>
    

    a标签里的内容实际上是注释,但是如果我们利用.string来输出它的内容时,注释符号已经去掉了。

    遍历文档树

    1.直接子节点:.contents .children属性

    .content

    Tag的.content属性可以将Tag的子节点以列表的方式输出

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    # 输出方式为列表
    print(soup.head.contents)
    
    print(soup.head.contents[0])
    

    运行结果

    [<title>The Dormouse's story</title>]
    <title>The Dormouse's story</title>
    

    .children

    它返回的不是一个列表,不过我们可以通过遍历获取所有的子节点。

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    # 输出方式为列表生成器对象
    print(soup.head.children)
    
    # 通过遍历获取所有子节点
    for child in soup.head.children:
        print(child)
    

    运行结果

    <list_iterator object at 0x008FF950>
    <title>The Dormouse's story</title>
    

    2.所有子孙节点:.descendants属性

    上面讲的.contents和.children属性仅包含Tag的直接子节点,.descendants属性可以对所有Tag的子孙节点进行递归循环,和children类似,我们也需要通过遍历的方式获取其中的内容。

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    # 输出方式为列表生成器对象
    print(soup.head.descendants)
    
    # 通过遍历获取所有子孙节点
    for child in soup.head.descendants:
        print(child)
    

    运行结果

    <generator object descendants at 0x00519AB0>
    <title>The Dormouse's story</title>
    The Dormouse's story
    

    3.节点内容:.string属性

    如果Tag只有一个NavigableString类型子节点,那么这个Tag可以使用.string得到子节点。如果一个Tag仅有一个子节点,那么这个Tab也可以使用.string方法,输出结果与当前唯一子节点的.string结果相同。

    通俗点来讲就是:如果一个标签里面没有标签了,那么.string就会返回标签里面的内容。如果标签里面只有唯一的一个标签了,那么.string也会返回里面的内容。例如:

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.head.string)
    
    print(soup.head.title.string)
    

    运行结果

    The Dormouse's story
    The Dormouse's story
    

    搜索文档树

    1.find_all(name, attrs, recursive, text, **kwargs)

    1)name参数

    name参数可以查找所有名字为name的Tag,字符串对象会被自动忽略掉

    a.传字符串

    最简单的过滤器就是字符串,在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配所有的内容,返回一个列表。

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.find_all("b"))
    
    print(soup.find_all("a"))
    

    运行结果

    [<b>The Dormouse's story</b>]
    [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    

    B.传正则表达式

    如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式match()来匹配内容

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    import re
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    for tag in soup.find_all(re.compile("^b")):
        print(tag.name)
    

    运行结果

    body
    b
    

    C.传列表

    如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容以列表方式返回

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.find_all(['a', 'b']))
    

    2)keyword参数

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.find_all(id="link1"))
    

    运行结果

    [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
    

    3)text参数

    通过text参数可以搜索文档中的字符串内容,与name参数的可选值一样,text参数接受字符串,正则表达式,列表

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    import re
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    # 字符串
    print(soup.find_all(text = " Elsie "))
    
    # 列表
    print(soup.find_all(text = ["Tillie", " Elsie ", "Lacie"]))
    
    # 正则表达式
    print(soup.find_all(text = re.compile("Dormouse")))
    

    运行结果

    [' Elsie ']
    [' Elsie ', 'Lacie', 'Tillie']
    ["The Dormouse's story", "The Dormouse's story"]
    

    CSS选择器

    这是另一种与find_all()方法有异曲同工的查找方法

    • 写CSS时,标签名不加任何修饰,类名前加.,id名前加#
    • 在这里我们也可以利用类似的方法来筛选元素,用到的方法是soup.select(),返回的类型是list

    (1)通过标签名查找

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.select("title"))
    
    print(soup.select("b"))
    
    print(soup.select("a"))
    

    运行结果

    [<title>The Dormouse's story</title>]
    [<b>The Dormouse's story</b>]
    [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    

    (2)通过类名查找

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.select(".title"))
    

    运行结果

    [<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]
    

    (3)通过id名查找

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.select("#link1"))
    

    运行结果

    [<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]
    

    (4)组合查找

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.select("p #link1"))
    

    运行结果

    [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
    

    (5)属性查找

    查找时还可以加入属性元素,属性需要用中括号括起来,注意属性和标签属于同一节点,所以中间不能加空格,否则会无法匹配到。

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.select("a[class='sister']"))
    

    运行结果

    [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    

    同样,属性仍然可以与上述查找方式组合,不在同一节点的空格隔开,同一节点的不加空格

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.select("p a[class='sister']"))
    

    运行结果

    [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    

    (6)获取内容

    以上的select()方法返回的结果都是列表形式,可以遍历形式输出,然后用get_text()方法来获取它的内容

    #!/usr/bin/python3
    # -*- coding:utf-8 -*-
    __author__ = 'mayi'
    
    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建 Beautiful Soup 对象,指定lxml解析器
    soup = BeautifulSoup(html, "lxml")
    
    print(soup.select("p a[class='sister']"))
    
    for item in soup.select("p a[class='sister']"):
        print(item.get_text())
    

    运行结果

    [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    
    Lacie
    Tillie
    

    注意:<!-- Elsie -->为注释内容,未输出

  • 相关阅读:
    pycharm连接mysql报错!Server returns invalid timezone. Go to 'Advanced' tab and set 'serverTimezone' prope
    Django之视图
    Django模板语言相关内容
    Django使用ORM之多对多(四)
    Django使用ORM之一对多(三)
    018_序列化模块_什么是模块
    017_os_sys_time_collection_random 模块
    015_内置函数2
    014_内置函数
    013_生成器(yield)_列表推导式
  • 原文地址:https://www.cnblogs.com/mayi0312/p/7221717.html
Copyright © 2011-2022 走看看