zoukankan      html  css  js  c++  java
  • beautifulSoup基本用法及find选择器

       总结来源于官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-all

    示例代码段

    复制代码
    html_doc = """
    <html>
    <head><title>The Dormouse's story <!--Hey, buddy. Want to buy a used parser?-->
    <a><!--Hey, buddy. Want to buy a used parser?--></a></title>
    </head>
    <body>
    <p class="title">
    <b>The Dormouse's story</b>
    <a><!--Hey, buddy. Want to buy a used parser?--></a>
    </p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1 link4">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
    """
    复制代码

      1、快速操作:

    复制代码
    soup.title  == soup.find('title')
    # <title>The Dormouse's story</title>
    
    soup.title.name
    # u'title'
    
    soup.title.string  == soup.title.text  == soup.title.get_text()
    # u'The Dormouse's story'
    
    soup.title.parent.name
    # u'head'
    
    soup.p   == soup.find('p')  # . 点属性,只能获取当前标签下的第一个标签
    # <p class="title"><b>The Dormouse's story</b></p>
    
    soup.p['class']
    # u'title'
    
    soup.a  == soup.find('a')
    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    
    soup.find_all('a')
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    
    soup.find_all(['a','b']) # 查找所有的a标签和b标签
    soup.find_all(id=["link1","link2"]) # 查找所有id=link1 和id=link2的标签
    soup.find(id
    ="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


    复制代码

      2、Beautiful Soup对象有四种类型:

        1、BeautifulSoup

        2、tag:标签

        3、NavigableString  : 标签中的文本,可包含注释内容

        4、Comment :标签中的注释,纯注释,没有正文内容

      标签属性的操做跟字典是一样一样的

      html多值属性(xml不适合):

        意思为一个属性名称,它是多值的,即包含多个属性值,即使属性中只有一个值也返回值为list,

        如:class,rel , rev , accept-charset , headers , accesskey

        其它属性为单值属性,即使属性值中有多个空格隔开的值,也是反回一个字符串

    soup.a['class']  #['sister']
    
    
    id_soup = BeautifulSoup('<p id="my id"></p>')
    id_soup.p['id']  #'my id'

      3、html中tag内容输出: 

        string:输出单一子标签文本内容或注释内容(选其一,标签中包含两种内容则输出为None)

        strings: 返回所有子孙标签的文本内容的生成器(不包含注释)

        stripped_strings:返回所有子孙标签的文本内容的生成器(不包含注释,并且在去掉了strings中的空行和空格)

        text:只输出文本内容,可同时输出多个子标签内容

        get_text():只输出文本内容,可同时输出多个子标签内容

      string:

    markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
    soup = BeautifulSoup(markup, 'html.parser')
    comm = soup.b.string
    print(comm)  # Hey, buddy. Want to buy a used parser?
    print(type(comm))  #<class 'bs4.element.Comment'>

       strings:

    复制代码
    head_tag = soup.body
    for s in head_tag.strings:
        print(repr(s))
    
    结果:
    '
    '
    "The Dormouse's story"
    '
    '
    'Once upon a time there were three little sisters; and their names were
            '
    'Elsie'
    ',
            '
    'Lacie'
    ' and
            '
    'Tillie'
    ';
            and they lived at the bottom of a well.
        '
    '
    '
    '...'
    '
    '
    复制代码

      stripped_strings:

    复制代码
    head_tag = soup.body
    for s in head_tag.stripped_strings:
        print(repr(s))
    
    结果:
    "The Dormouse's story"
    'Once upon a time there were three little sisters; and their names were'
    'Elsie'
    ','
    'Lacie'
    'and'
    'Tillie'
    ';
            and they lived at the bottom of a well.'
    '...'
    复制代码

      text:

    复制代码
    soup = BeautifulSoup(html_doc, 'html.parser')
    head_tag = soup.body
    print(head_tag.text)
    
    结果:
    The Dormouse's story
    Once upon a time there were three little sisters; and their names were
            Elsie,
            Lacie and
            Tillie;
            and they lived at the bottom of a well.
        
    ...
    复制代码
    复制代码
    soup = BeautifulSoup(html_doc, 'html.parser')
    head_tag = soup.body
    print(repr(head_tag.text))
    
    结果:
    "
    The Dormouse's story
    Once upon a time there were three little sisters; and their names were
            Elsie,
            Lacie and
            Tillie;
            and they lived at the bottom of a well.
        
    ...
    "
    复制代码

      4、返回子节点列表:

        .contents: 以列表的方式返回节点下的直接子节点

        .children:以生成器的方式反回节点下的直接子节点

    复制代码
    soup = BeautifulSoup(html_doc, 'html.parser')
    head_tag = soup.head
    print(head_tag)
    print(head_tag.contents)
    print(head_tag.contents[0])
    print(head_tag.contents[0].contents)
    
    for ch in head_tag.children:
        print(ch)
    
    结果:
    <head><title>The Dormouse's story</title></head>
    [<title>The Dormouse's story</title>]
    <title>The Dormouse's story</title>
    ["The Dormouse's story"]
    <title>The Dormouse's story</title>
    复制代码

      5、返回子孙节点的生成器:

         .descendants: 以列表的方式返回标签下的子孙节点

    复制代码
    for ch in head_tag.descendants:
        print(ch)
    
    结果:
    <title>The Dormouse's story</title>
    The Dormouse's story
    复制代码

      6、父标签(parent):如果是bs4对象,不管本来是标签还是文本都可以找到其父标签,但是文本对象不能找到父标签

    复制代码
    soup = BeautifulSoup(html_doc, 'html.parser')
    tag_title = soup.b  # b标签
    print(tag_title.parent)  # b标签的父标签 p
    print(type(tag_title.string))  # b标签中的文本的类型,文本中有注释时结果为None <class 'bs4.element.NavigableString'>
    print(tag_title.string.parent)  # b标签中文本的父标签 b
    print(type(tag_title.text))  # b 标签中的文本类型为str,无bs4属性找到父标签
    复制代码

      7、递归父标签(parents):递归得到元素的所有父辈节点

    复制代码
    soup = BeautifulSoup(html_doc, 'html.parser')
    link = soup.a
    for parent in link.parents:
        print(parent.name)

    结果:

    p
    body
    html
    [document]

    复制代码

      8、前后节点查询(不是前后标签哦,文本也是节点之一):previous_sibling,next_sibling

       9、以生成器的方式迭代返回所有兄弟节点

    复制代码
    for sib in soup.a.next_siblings:
        print(sib)
        print("---------")
    
    结果:
    -------------
    ,
            
    ---------
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    ---------
    
    
    ---------
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    ---------
    ;
            and they lived at the bottom of a well.
        
    ---------
    复制代码

      10、搜索文档树

        过滤器:

          1、字符串

          2、正则表达式

          3、列表

          4、True

          5、方法

    复制代码
    html_doc = """<html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were</p>
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
    
    <p class="story">...</p>
    </body>
    """
    from bs4 import BeautifulSoup
    import re
    soup = BeautifulSoup(html_doc, 'html.parser')
    soup.find_all("a")  # 字符串参数
    soup.find_all(re.compile("^b"))  # 正则参数
    soup.find_all(re.compile("a"))  # 正则参数
    soup.find_all(re.compile("l$"))  # 正则参数
    soup.find_all(["a", "b"])  # 标签的列表参数
    soup.find_all(True)  # 返回所有标签
    def has_class_no_id(tag):
        return tag.has_attr("class") and not tag.has_attr("id")
    soup.find_all(has_class_no_id)  # 方法参数
    复制代码

      11、find选择器:

        语法 :

        # find_all( name , attrs , recursive , text , **kwargs )
        #  name :要查找的标签名
        #  attrs: 标签的属性
        #  recursive: 递归
        #  text: 查找文本
        # **kwargs :其它 键值参数

      特殊情况:
        data-foo="value",因中横杠不识别的原因,只能写成attrs={"data-foo":"value"},
        class="value",因class是关键字,所以要写成class_="value"或attrs={"class":"value"}
    复制代码
    from bs4 import BeautifulSoup
    import re
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """
    
    # find_all( name , attrs , recursive , text , **kwargs )
    #  name :要查找的标签名(字符串、正则、方法、True)
    #  attrs: 标签的属性
    #  recursive: 递归
    #  text: 查找文本
    # **kwargs :其它 键值参数
    soup = BeautifulSoup(html_doc, 'html.parser')
    print(soup.find_all('p', 'title')) # p标签且class="title"
    soup.find_all('title')  # 以列表形式返回 所有title标签a
    soup.find_all(attrs={"class":"sister"})  # 以列表形式返回 所有class属性==sister的标签
    soup.find_all(id='link2')  # 返回所有id属性==link2的标签
    soup.find_all(href=re.compile("elsie")) # 返回所有href属性包含elsie的标签
    soup.find_all(id=True)  # 返回 所有包含id属性的标签
    soup.find_all(id="link1", href=re.compile('elsie'))  #  id=link1且href包含elsie
    复制代码

    关于class的搜索
    复制代码
    soup = BeautifulSoup(html_doc, 'html.parser')
    css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
    css_soup.find_all("p", class_="body")  # 多值class,指定其中一个即可
    css_soup.find_all("p", class_="strikeout")
    css_soup.find_all("p", class_="body strikeout")  # 精确匹配
    # text 参数可以是字符串,列表、方法、True
    soup.find_all("a", text="Elsie")  # text="Elsie"的a标签
    复制代码

      12、父节点方法:

        find_parents( name , attrs , recursive , text , **kwargs )

        find_parent( name , attrs , recursive , text , **kwargs )

    复制代码
    html_doc = """<html>
        <head>
            <title>The Dormouse's story</title>
        </head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were</p>
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <p>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        </p>
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
        <p class="story">...</p>
    </body>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser')
    a_string = soup.find(text="Lacie")  # 文本为Lacie的节点
    type(a_string), a_string  # <class 'bs4.element.NavigableString'> Lacie
    a_parent = a_string.find_parent()  # a_string的父节点中的第一个节点
    a_parent = a_string.find_parent("p")  # a_string的父节点中的第一个p节点
    a_parents = a_string.find_parents()  # a_string的父节点
    a_parents = a_string.find_parents("a")  # a_string的父点中所有a节点
    复制代码

      13、后面的邻居节点:

        find_next_siblings( name , attrs , recursive , text , **kwargs )

        find_next_sibling( name , attrs , recursive , text , **kwargs )

    复制代码
    html_doc = """<html><head><title>The Dormouse's story</title></head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were</p>
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <b href="http://example.com/elsie" class="sister" id="link1">Elsie</b>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        <p class="story">...</p>
    </body>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser')
    first_link = soup.a  # 第一个a标签
    a_sibling = first_link.find_next_sibling()  # 后面邻居的第一个
    a_sibling = first_link.find_next_sibling("a")  # 后面邻居的第一个a
    a_siblings = first_link.find_next_siblings()  # 后面的所有邻居
    a_siblings = first_link.find_next_siblings("a")  # 后面邻居的所有a邻居
    复制代码

       14、前面的邻居节点:

        find_previous_siblings( name , attrs , recursive , text , **kwargs )

        find_previous_sibling( name , attrs , recursive , text , **kwargs )

      15、后面的节点:

        find_all_next( name , attrs , recursive , text , **kwargs )

        find_next( name , attrs , recursive , text , **kwargs )

    复制代码
    html_doc = """<html>
        <head>
            <title>The Dormouse's story</title>
        </head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were</p>
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <p>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        </p>
        <p>
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        </p>
            and they lived at the bottom of a well.
        <p class="story">...</p>
    </body>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser')
    a_string = soup.find(text="Lacie")
    a_next = a_string.find_next()  # 后面所有子孙标签的第一个
    a_next = a_string.find_next('a')  # 后面所有子孙标签的第一个a标签
    a_nexts = a_string.find_all_next()  # 后面的所有子孙标签
    a_nexts = a_string.find_all_next('a')  # 后面的所有子孙标签中的所有a标签
    复制代码

       16、前面的节点:

        find_all_previous( name , attrs , recursive , text , **kwargs )

        find_previous( name , attrs , recursive , text , **kwargs )

      17、解析部分文档:

        如果仅仅因为想要查找文档中的<a>标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把<a>标签以外的东西都忽略掉. SoupStrainer 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 SoupStrainer 中定义过的文档. 创建一个 SoupStrainer 对象并作为 parse_only 参数给 BeautifulSoup 的构造方法即可。

      SoupStrainer 类参数:name , attrs , recursive , text , **kwargs

    复制代码
    html_doc = """<html>
        <head>
            <title>The Dormouse's story</title>
        </head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        </p>
            and they lived at the bottom of a well.
        <p class="story">...</p>
    </body>
    """
    from bs4 import SoupStrainer
    a_tags = SoupStrainer('a')  # 所有a标签
    id_tags = SoupStrainer(id="link2")  # id=link2的标签
    def is_short_string(string):
        return len(string) < 10  # string长度小于10,返回True
    short_string = SoupStrainer(text=is_short_string)  # 符合条件的文本
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser', parse_only=a_tags).prettify()
    soup = BeautifulSoup(html_doc, 'html.parser', parse_only=id_tags).prettify()
    soup = BeautifulSoup(html_doc, 'html.parser', parse_only=short_string).prettify()
    复制代码

    <div id="cnblogs_post_body" class="blogpost-body"><p>&nbsp;</p><p>  总结来源于官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-all</p><p>&nbsp;</p><p>示例代码段</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>html_doc = """<br>&lt;html&gt;<br>    &lt;head&gt;&lt;title&gt;The Dormouse's story &lt;!--Hey, buddy. Want to buy a used parser?--&gt;<br>    &lt;a&gt;&lt;!--Hey, buddy. Want to buy a used parser?--&gt;&lt;/a&gt;&lt;/title&gt;<br>    &lt;/head&gt;<br>&lt;body&gt;<br>    &lt;p class="title"&gt;<br>        &lt;b&gt;The Dormouse's story&lt;/b&gt;<br>        &lt;a&gt;&lt;!--Hey, buddy. Want to buy a used parser?--&gt;&lt;/a&gt;<br>    &lt;/p&gt;<br>    &lt;p class="story"&gt;Once upon a time there were three little sisters; and their names were<br>        &lt;a href="http://example.com/elsie" class="sister" id="link1 link4"&gt;Elsie&lt;/a&gt;,<br>        &lt;a href="http://example.com/lacie" class="sister" id="link2"&gt;Lacie&lt;/a&gt; and<br>        &lt;a href="http://example.com/tillie" class="sister" id="link3"&gt;Tillie&lt;/a&gt;;<br>        and they lived at the bottom of a well.<br>    &lt;/p&gt;<br>    &lt;p class="story"&gt;...&lt;/p&gt;<br>"""</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  1、快速操作:</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup.title  == soup.find(<span style="color: #800000">'</span><span style="color: #800000">title</span><span style="color: #800000">'</span><span style="color: #000000">)# </span>&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span><span style="color: #000000">soup.title.name# u</span><span style="color: #800000">'</span><span style="color: #800000">title</span><span style="color: #800000">'</span><span style="color: #000000">
    soup.title.</span><span style="color: #0000ff">string</span>  == soup.title.text  ==<span style="color: #000000"> soup.title.get_text()# u</span><span style="color: #800000">'</span><span style="color: #800000">The Dormouse</span><span style="color: #800000">'</span>s story<span style="color: #800000">'</span><span style="color: #000000">soup.title.parent.name# u</span><span style="color: #800000">'</span><span style="color: #800000">head</span><span style="color: #800000">'</span><span style="color: #000000">
    soup.p   </span>== soup.find(<span style="color: #800000">'</span><span style="color: #800000">p</span><span style="color: #800000">'</span><span style="color: #000000">)  # . 点属性,只能获取当前标签下的第一个标签# </span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span><span style="color: #000000">soup.p[</span><span style="color: #800000">'</span><span style="color: #800000">class</span><span style="color: #800000">'</span><span style="color: #000000">]# u</span><span style="color: #800000">'</span><span style="color: #800000">title</span><span style="color: #800000">'</span><span style="color: #000000">
    soup.a  </span>== soup.find(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span><span style="color: #000000">)# </span>&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">
    soup.find_all(</span><span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span><span style="color: #000000">)# [</span>&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,#  </span>&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000">,#  </span>&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">]<br>soup.find_all(['a','b'])  # 查找所有的a标签和b标签<br>soup.find_all(id=["link1","link2"])  # 查找所有id=link1 和id=link2的标签<br>soup.find(id</span>=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span><span style="color: #000000">)# </span>&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<br><br><br></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>  2、Beautiful Soup对象有四种类型:</p><p>    1、BeautifulSoup</p><p>    2、tag:标签</p><p>    3、NavigableString&nbsp; : 标签中的文本,可包含注释内容</p><p>    4、Comment :标签中的注释,纯注释,没有正文内容</p><p>&nbsp;</p><p>  标签属性的操做跟字典是一样一样的</p><p>  html多值属性(xml不适合):</p><p>    意思为一个属性名称,它是多值的,即包含多个属性值,即使属性中只有一个值也返回值为list,</p><p>    如:class,<tt class="docutils literal"><span class="pre">rel</span></tt>&nbsp;,&nbsp;<tt class="docutils literal"><span class="pre">rev</span></tt>&nbsp;,&nbsp;<tt class="docutils literal"><span class="pre">accept-charset</span></tt>&nbsp;,&nbsp;<tt class="docutils literal"><span class="pre">headers</span></tt>&nbsp;,&nbsp;<tt class="docutils literal"><span class="pre">accesskey</span></tt></p><p>    其它属性为单值属性,即使属性值中有多个空格隔开的值,也是反回一个字符串</p><div class="cnblogs_code"><pre>soup.a[<span style="color: #800000">'</span><span style="color: #800000">class</span><span style="color: #800000">'</span>]  #[<span style="color: #800000">'</span><span style="color: #800000">sister</span><span style="color: #800000">'</span><span style="color: #000000">]

    id_soup </span>= BeautifulSoup(<span style="color: #800000">'</span><span style="color: #800000">&lt;p id="my id"&gt;&lt;/p&gt;</span><span style="color: #800000">'</span><span style="color: #000000">)id_soup.p[</span><span style="color: #800000">'</span><span style="color: #800000">id</span><span style="color: #800000">'</span>]  #<span style="color: #800000">'</span><span style="color: #800000">my id</span><span style="color: #800000">'</span></pre></div><p>&nbsp;</p><p>  3、html中tag内容输出: </p><p>    string:输出单一子标签文本内容或注释内容(选其一,标签中包含两种内容则输出为None)</p><p>    strings: 返回所有子孙标签的文本内容的生成器(不包含注释)</p><p>    stripped_strings:返回所有子孙标签的文本内容的生成器(不包含注释,并且在去掉了strings中的空行和空格)</p><p>    text:只输出文本内容,可同时输出多个子标签内容</p><p>    get_text():只输出文本内容,可同时输出多个子标签内容</p><p>  string:</p><div class="cnblogs_code"><pre>markup = <span style="color: #800000">"</span><span style="color: #800000">&lt;b&gt;&lt;!--Hey, buddy. Want to buy a used parser?--&gt;&lt;/b&gt;</span><span style="color: #800000">"</span><span style="color: #000000">soup </span>= BeautifulSoup(markup, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)comm </span>= soup.b.<span style="color: #0000ff">string</span><span style="color: #000000">print(comm)  # Hey, buddy. Want to buy a used parser?print(type(comm))  #&lt;class 'bs4.element.Comment'&gt;</span></pre></div><p>&nbsp;  strings:</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>head_tag =<span style="color: #000000"> soup.body</span><span style="color: #0000ff">for</span> s <span style="color: #0000ff">in</span><span style="color: #000000"> head_tag.strings:    print(repr(s))
    结果:</span><span style="color: #800000">'</span><span style="color: #800000"> </span><span style="color: #800000">'</span><span style="color: #800000">"</span><span style="color: #800000">The Dormouse's story</span><span style="color: #800000">"</span><span style="color: #800000">'</span><span style="color: #800000"> </span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Once upon a time there were three little sisters; and their names were         </span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Elsie</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">,         </span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Lacie</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000"> and         </span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Tillie</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">;         and they lived at the bottom of a well.     </span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000"> </span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">...</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000"> </span><span style="color: #800000">'</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>  stripped_strings:</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>head_tag =<span style="color: #000000"> soup.body</span><span style="color: #0000ff">for</span> s <span style="color: #0000ff">in</span><span style="color: #000000"> head_tag.stripped_strings:    print(repr(s))
    结果:</span><span style="color: #800000">"</span><span style="color: #800000">The Dormouse's story</span><span style="color: #800000">"</span><span style="color: #800000">'</span><span style="color: #800000">Once upon a time there were three little sisters; and their names were</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Elsie</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">,</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Lacie</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">and</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Tillie</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">;         and they lived at the bottom of a well.</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">...</span><span style="color: #800000">'</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>  text:</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)head_tag </span>=<span style="color: #000000"> soup.bodyprint(head_tag.text)
    结果:The Dormouse</span><span style="color: #800000">'</span><span style="color: #800000">s story</span><span style="color: #000000">Once upon a time there were three little sisters; and their names were        Elsie,        Lacie and        Tillie;        and they lived at the bottom of a well.    ...</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)head_tag </span>=<span style="color: #000000"> soup.bodyprint(repr(head_tag.text))
    结果:</span><span style="color: #800000">"</span><span style="color: #800000"> The Dormouse's story Once upon a time there were three little sisters; and their names were         Elsie,         Lacie and         Tillie;         and they lived at the bottom of a well.     ... </span><span style="color: #800000">"</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>&nbsp;</p><p>  4、返回子节点列表:</p><p>    .contents: 以列表的方式返回节点下的直接子节点</p><p>    .children:以生成器的方式反回节点下的直接子节点</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)head_tag </span>=<span style="color: #000000"> soup.headprint(head_tag)print(head_tag.contents)print(head_tag.contents[</span><span style="color: #800080">0</span><span style="color: #000000">])print(head_tag.contents[</span><span style="color: #800080">0</span><span style="color: #000000">].contents)
    </span><span style="color: #0000ff">for</span> ch <span style="color: #0000ff">in</span><span style="color: #000000"> head_tag.children:    print(ch)
    结果:</span>&lt;head&gt;&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;&lt;/head&gt;</span>[&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;]</span>&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span>[<span style="color: #800000">"</span><span style="color: #800000">The Dormouse's story</span><span style="color: #800000">"</span><span style="color: #000000">]</span>&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  5、返回子孙节点的生成器:</p><p>     .descendants: 以列表的方式返回标签下的子孙节点</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre><span style="color: #0000ff">for</span> ch <span style="color: #0000ff">in</span><span style="color: #000000"> head_tag.descendants:    print(ch)
    结果:</span>&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span>The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  6、父标签(parent):如果是bs4对象,不管本来是标签还是文本都可以找到其父标签,但是文本对象不能找到父标签</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)tag_title </span>=<span style="color: #000000"> soup.b  # b标签print(tag_title.parent)  # b标签的父标签 pprint(type(tag_title.</span><span style="color: #0000ff">string</span>))  # b标签中的文本的类型,文本中有注释时结果为None &lt;<span style="color: #0000ff">class</span> <span style="color: #800000">'</span><span style="color: #800000">bs4.element.NavigableString</span><span style="color: #800000">'</span>&gt;<span style="color: #000000">print(tag_title.</span><span style="color: #0000ff">string</span><span style="color: #000000">.parent)  # b标签中文本的父标签 bprint(type(tag_title.text))  # b 标签中的文本类型为str,无bs4属性找到父标签</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  7、递归父标签(parents):递归得到元素的所有父辈节点</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)link </span>=<span style="color: #000000"> soup.a</span><span style="color: #0000ff">for</span> parent <span style="color: #0000ff">in</span><span style="color: #000000"> link.parents:    print(parent.name)<br><br>结果:<br></span></pre><p>p<br>body<br>html<br>[document]</p>















    <div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  8、前后节点查询(不是前后标签哦,文本也是节点之一):previous_sibling,next_sibling</p><p><img src="https://images2017.cnblogs.com/blog/931154/201801/931154-20180124082140694-1377077553.png" alt=""></p><p>&nbsp;</p><p>&nbsp;  9、以生成器的方式迭代返回所有兄弟节点</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre><span style="color: #0000ff">for</span> sib <span style="color: #0000ff">in</span><span style="color: #000000"> soup.a.next_siblings:    print(sib)    print(</span><span style="color: #800000">"</span><span style="color: #800000">---------</span><span style="color: #800000">"</span><span style="color: #000000">)
    结果:</span>-------------<span style="color: #000000">,        </span>---------&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;---------

    ---------&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;---------<span style="color: #000000">;        and they lived at the bottom of a well.    </span>---------</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  10、搜索文档树</p><p>    过滤器:</p><p>      1、字符串</p><p>      2、正则表达式</p><p>      3、列表</p><p>      4、True</p><p>      5、方法</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>html_doc = <span style="color: #800000">"""</span><span style="color: #800000">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;</span>&lt;body&gt;&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;Once upon a time there were three little sisters; and their names were&lt;/p&gt;&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,</span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000"> and</span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">;and they lived at the bottom of a well.
    </span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;...&lt;/p&gt;&lt;/body&gt;<span style="color: #800000">"""</span><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import BeautifulSoupimport resoup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)soup.find_all(</span><span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span><span style="color: #000000">)  # 字符串参数soup.find_all(re.compile(</span><span style="color: #800000">"</span><span style="color: #800000">^b</span><span style="color: #800000">"</span><span style="color: #000000">))  # 正则参数soup.find_all(re.compile(</span><span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span><span style="color: #000000">))  # 正则参数soup.find_all(re.compile(</span><span style="color: #800000">"</span><span style="color: #800000">l$</span><span style="color: #800000">"</span><span style="color: #000000">))  # 正则参数soup.find_all([</span><span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span>, <span style="color: #800000">"</span><span style="color: #800000">b</span><span style="color: #800000">"</span><span style="color: #000000">])  # 标签的列表参数soup.find_all(True)  # 返回所有标签def has_class_no_id(tag):    </span><span style="color: #0000ff">return</span> tag.has_attr(<span style="color: #800000">"</span><span style="color: #800000">class</span><span style="color: #800000">"</span>) and not tag.has_attr(<span style="color: #800000">"</span><span style="color: #800000">id</span><span style="color: #800000">"</span><span style="color: #000000">)soup.find_all(has_class_no_id)  # 方法参数</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p><span style="font-family: 黑体; font-size: 14pt">  11、find选择器:</span></p><p>    语法 :</p><pre><span>    # find_all( name , attrs , recursive , text , **<span>kwargs )    #  name :要查找的标签名    #  attrs: 标签的属性    #  recursive: 递归    #  text: 查找文本    # **<span>kwargs :其它 键值参数<br><br>  特殊情况:<br>    </span></span></span><span class="s">data-foo="value",</span><span class="s">因中横杠不识别的原因,只能写成</span><span class="s">attrs={"data-foo":"value"},</span></pre><pre><span><span><span>    class="value",因class是关键字,所以要写成class_="value"或attrs={"class":"value"}</span></span></span></pre><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import BeautifulSoupimport rehtml_doc </span>= <span style="color: #800000">"""</span>&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;&lt;/head&gt;</span>
    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span>
    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;<span style="color: #000000">Once upon a time there were three little sisters; and their names were</span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,</span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000"> and</span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">;and they lived at the bottom of a well.</span>&lt;/p&gt;
    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;...&lt;/p&gt;<span style="color: #800000">"""</span><span style="color: #000000"># find_all( name , attrs , recursive , text , </span>**<span style="color: #000000">kwargs )#  name :要查找的标签名(字符串、正则、方法、True)#  attrs: 标签的属性#  recursive: 递归#  text: 查找文本# </span>**<span style="color: #000000">kwargs :其它 键值参数soup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)print(soup.find_all(</span><span style="color: #800000">'</span><span style="color: #800000">p</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">title</span><span style="color: #800000">'</span>)) # p标签且class=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span><span style="color: #000000">soup.find_all(</span><span style="color: #800000">'</span><span style="color: #800000">title</span><span style="color: #800000">'</span><span style="color: #000000">)  # 以列表形式返回 所有title标签asoup.find_all(attrs</span>={<span style="color: #800000">"</span><span style="color: #800000">class</span><span style="color: #800000">"</span>:<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span>})  # 以列表形式返回 所有class属性==<span style="color: #000000">sister的标签soup.find_all(id</span>=<span style="color: #800000">'</span><span style="color: #800000">link2</span><span style="color: #800000">'</span>)  # 返回所有id属性==<span style="color: #000000">link2的标签soup.find_all(href</span>=re.compile(<span style="color: #800000">"</span><span style="color: #800000">elsie</span><span style="color: #800000">"</span><span style="color: #000000">)) # 返回所有href属性包含elsie的标签soup.find_all(id</span>=<span style="color: #000000">True)  # 返回 所有包含id属性的标签soup.find_all(id</span>=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>, href=re.compile(<span style="color: #800000">'</span><span style="color: #800000">elsie</span><span style="color: #800000">'</span>))  #  id=link1且href包含elsie</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p><img src="https://images2017.cnblogs.com/blog/931154/201801/931154-20180128222706647-1457600468.png" alt=""></p><pre>关于class的搜索</pre><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)css_soup </span>= BeautifulSoup(<span style="color: #800000">'</span><span style="color: #800000">&lt;p class="body strikeout"&gt;&lt;/p&gt;</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)css_soup.find_all(</span><span style="color: #800000">"</span><span style="color: #800000">p</span><span style="color: #800000">"</span>, class_=<span style="color: #800000">"</span><span style="color: #800000">body</span><span style="color: #800000">"</span><span style="color: #000000">)  # 多值class,指定其中一个即可css_soup.find_all(</span><span style="color: #800000">"</span><span style="color: #800000">p</span><span style="color: #800000">"</span>, class_=<span style="color: #800000">"</span><span style="color: #800000">strikeout</span><span style="color: #800000">"</span><span style="color: #000000">)css_soup.find_all(</span><span style="color: #800000">"</span><span style="color: #800000">p</span><span style="color: #800000">"</span>, class_=<span style="color: #800000">"</span><span style="color: #800000">body strikeout</span><span style="color: #800000">"</span><span style="color: #000000">)  # 精确匹配# text 参数可以是字符串,列表、方法、Truesoup.find_all(</span><span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span>, text=<span style="color: #800000">"</span><span style="color: #800000">Elsie</span><span style="color: #800000">"</span>)  # text=<span style="color: #800000">"</span><span style="color: #800000">Elsie</span><span style="color: #800000">"</span>的a标签</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  12、父节点方法:</p><p>    find_parents(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>    find_parent(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>html_doc = <span style="color: #800000">"""</span><span style="color: #800000">&lt;html&gt;</span>    &lt;head&gt;        &lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span>    &lt;/head&gt;&lt;body&gt;    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span>    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;Once upon a time there were three little sisters; and their names were&lt;/p&gt;    &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,    </span>&lt;p&gt;        &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000"> and    </span>&lt;/p&gt;    &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">;    and they lived at the bottom of a well.    </span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;...&lt;/p&gt;&lt;/body&gt;<span style="color: #800000">"""</span><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import BeautifulSoupsoup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)a_string </span>= soup.find(text=<span style="color: #800000">"</span><span style="color: #800000">Lacie</span><span style="color: #800000">"</span><span style="color: #000000">)  # 文本为Lacie的节点type(a_string), a_string  # </span>&lt;<span style="color: #0000ff">class</span> <span style="color: #800000">'</span><span style="color: #800000">bs4.element.NavigableString</span><span style="color: #800000">'</span>&gt;<span style="color: #000000"> Laciea_parent </span>=<span style="color: #000000"> a_string.find_parent()  # a_string的父节点中的第一个节点a_parent </span>= a_string.find_parent(<span style="color: #800000">"</span><span style="color: #800000">p</span><span style="color: #800000">"</span><span style="color: #000000">)  # a_string的父节点中的第一个p节点a_parents </span>=<span style="color: #000000"> a_string.find_parents()  # a_string的父节点a_parents </span>= a_string.find_parents(<span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span>)  # a_string的父点中所有a节点</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  13、后面的邻居节点:</p><p>    find_next_siblings(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>    find_next_sibling(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>html_doc = <span style="color: #800000">"""</span><span style="color: #800000">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;</span>&lt;body&gt;    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span>    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;Once upon a time there were three little sisters; and their names were&lt;/p&gt;    &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,    </span>&lt;b href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/b&gt;<span style="color: #000000">,    </span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000"> and    </span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">;        and they lived at the bottom of a well.    </span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;...&lt;/p&gt;&lt;/body&gt;<span style="color: #800000">"""</span><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import BeautifulSoupsoup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)first_link </span>=<span style="color: #000000"> soup.a  # 第一个a标签a_sibling </span>=<span style="color: #000000"> first_link.find_next_sibling()  # 后面邻居的第一个a_sibling </span>= first_link.find_next_sibling(<span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span><span style="color: #000000">)  # 后面邻居的第一个aa_siblings </span>=<span style="color: #000000"> first_link.find_next_siblings()  # 后面的所有邻居a_siblings </span>= first_link.find_next_siblings(<span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span>)  # 后面邻居的所有a邻居</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>&nbsp;  14、前面的邻居节点:</p><p>    find_previous_siblings(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>    find_previous_sibling(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>&nbsp;</p><p>  15、后面的节点:</p><p>    find_all_next(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>    find_next(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>html_doc = <span style="color: #800000">"""</span><span style="color: #800000">&lt;html&gt;</span>    &lt;head&gt;        &lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span>    &lt;/head&gt;&lt;body&gt;    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span>    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;Once upon a time there were three little sisters; and their names were&lt;/p&gt;    &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,    </span>&lt;p&gt;        &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000"> and    </span>&lt;/p&gt;    &lt;p&gt;        &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">;    </span>&lt;/p&gt;<span style="color: #000000">        and they lived at the bottom of a well.    </span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;...&lt;/p&gt;&lt;/body&gt;<span style="color: #800000">"""</span><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import BeautifulSoupsoup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)a_string </span>= soup.find(text=<span style="color: #800000">"</span><span style="color: #800000">Lacie</span><span style="color: #800000">"</span><span style="color: #000000">)a_next </span>=<span style="color: #000000"> a_string.find_next()  # 后面所有子孙标签的第一个a_next </span>= a_string.find_next(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span><span style="color: #000000">)  # 后面所有子孙标签的第一个a标签a_nexts </span>=<span style="color: #000000"> a_string.find_all_next()  # 后面的所有子孙标签a_nexts </span>= a_string.find_all_next(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span>)  # 后面的所有子孙标签中的所有a标签</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>&nbsp;  16、前面的节点:</p><p>    find_all_previous(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>    find_previous(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>&nbsp;</p><p>  17、解析部分文档:</p><p>    如果仅仅因为想要查找文档中的&lt;a&gt;标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把&lt;a&gt;标签以外的东西都忽略掉.&nbsp;<tt class="docutils literal"><span class="pre">SoupStrainer</span></tt>&nbsp;类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在&nbsp;<tt class="docutils literal"><span class="pre">SoupStrainer</span></tt>&nbsp;中定义过的文档. 创建一个&nbsp;<tt class="docutils literal"><span class="pre">SoupStrainer</span></tt>&nbsp;对象并作为&nbsp;<tt class="docutils literal"><span class="pre">parse_only</span></tt>&nbsp;参数给&nbsp;<tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt>&nbsp;的构造方法即可。</p><p><tt class="docutils literal"><span class="pre">  SoupStrainer</span></tt>&nbsp;类参数:<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a></p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>html_doc = <span style="color: #800000">"""</span><span style="color: #800000">&lt;html&gt;</span>    &lt;head&gt;        &lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span>    &lt;/head&gt;&lt;body&gt;    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span>    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;<span style="color: #000000">Once upon a time there were three little sisters; and their names were        </span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,        </span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000"> and        </span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">;    </span>&lt;/p&gt;<span style="color: #000000">        and they lived at the bottom of a well.    </span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;...&lt;/p&gt;&lt;/body&gt;<span style="color: #800000">"""</span><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import SoupStrainera_tags </span>= SoupStrainer(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span><span style="color: #000000">)  # 所有a标签id_tags </span>= SoupStrainer(id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>)  # id=<span style="color: #000000">link2的标签def is_short_string(</span><span style="color: #0000ff">string</span><span style="color: #000000">):    </span><span style="color: #0000ff">return</span> len(<span style="color: #0000ff">string</span>) &lt; <span style="color: #800080">10</span><span style="color: #000000">  # string长度小于10,返回Trueshort_string </span>= SoupStrainer(text=<span style="color: #000000">is_short_string)  # 符合条件的文本
    </span><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import BeautifulSoupsoup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span>, parse_only=<span style="color: #000000">a_tags).prettify()soup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span>, parse_only=<span style="color: #000000">id_tags).prettify()soup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span>, parse_only=short_string).prettify()</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p></div>

  • 相关阅读:
    Installing Oracle Database 12c Release 2(12.2) RAC on RHEL7.3 in Silent Mode
    周四测试
    假期生活
    《人月神话》阅读笔记三
    《人月神话》阅读笔记二
    《人月神话》阅读笔记一
    软件进度7
    软件进度6
    软件进度5
    软件进度4
  • 原文地址:https://www.cnblogs.com/l-jie-n/p/9749562.html
Copyright © 2011-2022 走看看