zoukankan      html  css  js  c++  java
  • beautifulSoup基本用法及find选择器

       总结来源于官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-all

    示例代码段

    复制代码
    html_doc = """
    <html>
    <head><title>The Dormouse's story <!--Hey, buddy. Want to buy a used parser?-->
    <a><!--Hey, buddy. Want to buy a used parser?--></a></title>
    </head>
    <body>
    <p class="title">
    <b>The Dormouse's story</b>
    <a><!--Hey, buddy. Want to buy a used parser?--></a>
    </p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1 link4">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
    """
    复制代码

      1、快速操作:

    复制代码
    soup.title  == soup.find('title')
    # <title>The Dormouse's story</title>
    
    soup.title.name
    # u'title'
    
    soup.title.string  == soup.title.text  == soup.title.get_text()
    # u'The Dormouse's story'
    
    soup.title.parent.name
    # u'head'
    
    soup.p   == soup.find('p')  # . 点属性,只能获取当前标签下的第一个标签
    # <p class="title"><b>The Dormouse's story</b></p>
    
    soup.p['class']
    # u'title'
    
    soup.a  == soup.find('a')
    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    
    soup.find_all('a')
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    
    soup.find_all(['a','b']) # 查找所有的a标签和b标签
    soup.find_all(id=["link1","link2"]) # 查找所有id=link1 和id=link2的标签
    soup.find(id
    ="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


    复制代码

      2、Beautiful Soup对象有四种类型:

        1、BeautifulSoup

        2、tag:标签

        3、NavigableString  : 标签中的文本,可包含注释内容

        4、Comment :标签中的注释,纯注释,没有正文内容

      标签属性的操做跟字典是一样一样的

      html多值属性(xml不适合):

        意思为一个属性名称,它是多值的,即包含多个属性值,即使属性中只有一个值也返回值为list,

        如:class,rel , rev , accept-charset , headers , accesskey

        其它属性为单值属性,即使属性值中有多个空格隔开的值,也是反回一个字符串

    soup.a['class']  #['sister']
    
    
    id_soup = BeautifulSoup('<p id="my id"></p>')
    id_soup.p['id']  #'my id'

      3、html中tag内容输出: 

        string:输出单一子标签文本内容或注释内容(选其一,标签中包含两种内容则输出为None)

        strings: 返回所有子孙标签的文本内容的生成器(不包含注释)

        stripped_strings:返回所有子孙标签的文本内容的生成器(不包含注释,并且在去掉了strings中的空行和空格)

        text:只输出文本内容,可同时输出多个子标签内容

        get_text():只输出文本内容,可同时输出多个子标签内容

      string:

    markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
    soup = BeautifulSoup(markup, 'html.parser')
    comm = soup.b.string
    print(comm)  # Hey, buddy. Want to buy a used parser?
    print(type(comm))  #<class 'bs4.element.Comment'>

       strings:

    复制代码
    head_tag = soup.body
    for s in head_tag.strings:
        print(repr(s))
    
    结果:
    '
    '
    "The Dormouse's story"
    '
    '
    'Once upon a time there were three little sisters; and their names were
            '
    'Elsie'
    ',
            '
    'Lacie'
    ' and
            '
    'Tillie'
    ';
            and they lived at the bottom of a well.
        '
    '
    '
    '...'
    '
    '
    复制代码

      stripped_strings:

    复制代码
    head_tag = soup.body
    for s in head_tag.stripped_strings:
        print(repr(s))
    
    结果:
    "The Dormouse's story"
    'Once upon a time there were three little sisters; and their names were'
    'Elsie'
    ','
    'Lacie'
    'and'
    'Tillie'
    ';
            and they lived at the bottom of a well.'
    '...'
    复制代码

      text:

    复制代码
    soup = BeautifulSoup(html_doc, 'html.parser')
    head_tag = soup.body
    print(head_tag.text)
    
    结果:
    The Dormouse's story
    Once upon a time there were three little sisters; and their names were
            Elsie,
            Lacie and
            Tillie;
            and they lived at the bottom of a well.
        
    ...
    复制代码
    复制代码
    soup = BeautifulSoup(html_doc, 'html.parser')
    head_tag = soup.body
    print(repr(head_tag.text))
    
    结果:
    "
    The Dormouse's story
    Once upon a time there were three little sisters; and their names were
            Elsie,
            Lacie and
            Tillie;
            and they lived at the bottom of a well.
        
    ...
    "
    复制代码

      4、返回子节点列表:

        .contents: 以列表的方式返回节点下的直接子节点

        .children:以生成器的方式反回节点下的直接子节点

    复制代码
    soup = BeautifulSoup(html_doc, 'html.parser')
    head_tag = soup.head
    print(head_tag)
    print(head_tag.contents)
    print(head_tag.contents[0])
    print(head_tag.contents[0].contents)
    
    for ch in head_tag.children:
        print(ch)
    
    结果:
    <head><title>The Dormouse's story</title></head>
    [<title>The Dormouse's story</title>]
    <title>The Dormouse's story</title>
    ["The Dormouse's story"]
    <title>The Dormouse's story</title>
    复制代码

      5、返回子孙节点的生成器:

         .descendants: 以列表的方式返回标签下的子孙节点

    复制代码
    for ch in head_tag.descendants:
        print(ch)
    
    结果:
    <title>The Dormouse's story</title>
    The Dormouse's story
    复制代码

      6、父标签(parent):如果是bs4对象,不管本来是标签还是文本都可以找到其父标签,但是文本对象不能找到父标签

    复制代码
    soup = BeautifulSoup(html_doc, 'html.parser')
    tag_title = soup.b  # b标签
    print(tag_title.parent)  # b标签的父标签 p
    print(type(tag_title.string))  # b标签中的文本的类型,文本中有注释时结果为None <class 'bs4.element.NavigableString'>
    print(tag_title.string.parent)  # b标签中文本的父标签 b
    print(type(tag_title.text))  # b 标签中的文本类型为str,无bs4属性找到父标签
    复制代码

      7、递归父标签(parents):递归得到元素的所有父辈节点

    复制代码
    soup = BeautifulSoup(html_doc, 'html.parser')
    link = soup.a
    for parent in link.parents:
        print(parent.name)

    结果:

    p
    body
    html
    [document]

    复制代码

      8、前后节点查询(不是前后标签哦,文本也是节点之一):previous_sibling,next_sibling

       9、以生成器的方式迭代返回所有兄弟节点

    复制代码
    for sib in soup.a.next_siblings:
        print(sib)
        print("---------")
    
    结果:
    -------------
    ,
            
    ---------
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    ---------
    
    
    ---------
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    ---------
    ;
            and they lived at the bottom of a well.
        
    ---------
    复制代码

      10、搜索文档树

        过滤器:

          1、字符串

          2、正则表达式

          3、列表

          4、True

          5、方法

    复制代码
    html_doc = """<html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were</p>
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
    
    <p class="story">...</p>
    </body>
    """
    from bs4 import BeautifulSoup
    import re
    soup = BeautifulSoup(html_doc, 'html.parser')
    soup.find_all("a")  # 字符串参数
    soup.find_all(re.compile("^b"))  # 正则参数
    soup.find_all(re.compile("a"))  # 正则参数
    soup.find_all(re.compile("l$"))  # 正则参数
    soup.find_all(["a", "b"])  # 标签的列表参数
    soup.find_all(True)  # 返回所有标签
    def has_class_no_id(tag):
        return tag.has_attr("class") and not tag.has_attr("id")
    soup.find_all(has_class_no_id)  # 方法参数
    复制代码

      11、find选择器:

        语法 :

        # find_all( name , attrs , recursive , text , **kwargs )
        #  name :要查找的标签名
        #  attrs: 标签的属性
        #  recursive: 递归
        #  text: 查找文本
        # **kwargs :其它 键值参数

      特殊情况:
        data-foo="value",因中横杠不识别的原因,只能写成attrs={"data-foo":"value"},
        class="value",因class是关键字,所以要写成class_="value"或attrs={"class":"value"}
    复制代码
    from bs4 import BeautifulSoup
    import re
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """
    
    # find_all( name , attrs , recursive , text , **kwargs )
    #  name :要查找的标签名(字符串、正则、方法、True)
    #  attrs: 标签的属性
    #  recursive: 递归
    #  text: 查找文本
    # **kwargs :其它 键值参数
    soup = BeautifulSoup(html_doc, 'html.parser')
    print(soup.find_all('p', 'title')) # p标签且class="title"
    soup.find_all('title')  # 以列表形式返回 所有title标签a
    soup.find_all(attrs={"class":"sister"})  # 以列表形式返回 所有class属性==sister的标签
    soup.find_all(id='link2')  # 返回所有id属性==link2的标签
    soup.find_all(href=re.compile("elsie")) # 返回所有href属性包含elsie的标签
    soup.find_all(id=True)  # 返回 所有包含id属性的标签
    soup.find_all(id="link1", href=re.compile('elsie'))  #  id=link1且href包含elsie
    复制代码

    关于class的搜索
    复制代码
    soup = BeautifulSoup(html_doc, 'html.parser')
    css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
    css_soup.find_all("p", class_="body")  # 多值class,指定其中一个即可
    css_soup.find_all("p", class_="strikeout")
    css_soup.find_all("p", class_="body strikeout")  # 精确匹配
    # text 参数可以是字符串,列表、方法、True
    soup.find_all("a", text="Elsie")  # text="Elsie"的a标签
    复制代码

      12、父节点方法:

        find_parents( name , attrs , recursive , text , **kwargs )

        find_parent( name , attrs , recursive , text , **kwargs )

    复制代码
    html_doc = """<html>
        <head>
            <title>The Dormouse's story</title>
        </head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were</p>
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <p>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        </p>
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
        <p class="story">...</p>
    </body>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser')
    a_string = soup.find(text="Lacie")  # 文本为Lacie的节点
    type(a_string), a_string  # <class 'bs4.element.NavigableString'> Lacie
    a_parent = a_string.find_parent()  # a_string的父节点中的第一个节点
    a_parent = a_string.find_parent("p")  # a_string的父节点中的第一个p节点
    a_parents = a_string.find_parents()  # a_string的父节点
    a_parents = a_string.find_parents("a")  # a_string的父点中所有a节点
    复制代码

      13、后面的邻居节点:

        find_next_siblings( name , attrs , recursive , text , **kwargs )

        find_next_sibling( name , attrs , recursive , text , **kwargs )

    复制代码
    html_doc = """<html><head><title>The Dormouse's story</title></head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were</p>
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <b href="http://example.com/elsie" class="sister" id="link1">Elsie</b>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        <p class="story">...</p>
    </body>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser')
    first_link = soup.a  # 第一个a标签
    a_sibling = first_link.find_next_sibling()  # 后面邻居的第一个
    a_sibling = first_link.find_next_sibling("a")  # 后面邻居的第一个a
    a_siblings = first_link.find_next_siblings()  # 后面的所有邻居
    a_siblings = first_link.find_next_siblings("a")  # 后面邻居的所有a邻居
    复制代码

       14、前面的邻居节点:

        find_previous_siblings( name , attrs , recursive , text , **kwargs )

        find_previous_sibling( name , attrs , recursive , text , **kwargs )

      15、后面的节点:

        find_all_next( name , attrs , recursive , text , **kwargs )

        find_next( name , attrs , recursive , text , **kwargs )

    复制代码
    html_doc = """<html>
        <head>
            <title>The Dormouse's story</title>
        </head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were</p>
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <p>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        </p>
        <p>
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        </p>
            and they lived at the bottom of a well.
        <p class="story">...</p>
    </body>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser')
    a_string = soup.find(text="Lacie")
    a_next = a_string.find_next()  # 后面所有子孙标签的第一个
    a_next = a_string.find_next('a')  # 后面所有子孙标签的第一个a标签
    a_nexts = a_string.find_all_next()  # 后面的所有子孙标签
    a_nexts = a_string.find_all_next('a')  # 后面的所有子孙标签中的所有a标签
    复制代码

       16、前面的节点:

        find_all_previous( name , attrs , recursive , text , **kwargs )

        find_previous( name , attrs , recursive , text , **kwargs )

      17、解析部分文档:

        如果仅仅因为想要查找文档中的<a>标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把<a>标签以外的东西都忽略掉. SoupStrainer 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 SoupStrainer 中定义过的文档. 创建一个 SoupStrainer 对象并作为 parse_only 参数给 BeautifulSoup 的构造方法即可。

      SoupStrainer 类参数:name , attrs , recursive , text , **kwargs

    复制代码
    html_doc = """<html>
        <head>
            <title>The Dormouse's story</title>
        </head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        </p>
            and they lived at the bottom of a well.
        <p class="story">...</p>
    </body>
    """
    from bs4 import SoupStrainer
    a_tags = SoupStrainer('a')  # 所有a标签
    id_tags = SoupStrainer(id="link2")  # id=link2的标签
    def is_short_string(string):
        return len(string) < 10  # string长度小于10,返回True
    short_string = SoupStrainer(text=is_short_string)  # 符合条件的文本
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser', parse_only=a_tags).prettify()
    soup = BeautifulSoup(html_doc, 'html.parser', parse_only=id_tags).prettify()
    soup = BeautifulSoup(html_doc, 'html.parser', parse_only=short_string).prettify()
    复制代码

    <div id="cnblogs_post_body" class="blogpost-body"><p>&nbsp;</p><p>  总结来源于官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-all</p><p>&nbsp;</p><p>示例代码段</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>html_doc = """<br>&lt;html&gt;<br>    &lt;head&gt;&lt;title&gt;The Dormouse's story &lt;!--Hey, buddy. Want to buy a used parser?--&gt;<br>    &lt;a&gt;&lt;!--Hey, buddy. Want to buy a used parser?--&gt;&lt;/a&gt;&lt;/title&gt;<br>    &lt;/head&gt;<br>&lt;body&gt;<br>    &lt;p class="title"&gt;<br>        &lt;b&gt;The Dormouse's story&lt;/b&gt;<br>        &lt;a&gt;&lt;!--Hey, buddy. Want to buy a used parser?--&gt;&lt;/a&gt;<br>    &lt;/p&gt;<br>    &lt;p class="story"&gt;Once upon a time there were three little sisters; and their names were<br>        &lt;a href="http://example.com/elsie" class="sister" id="link1 link4"&gt;Elsie&lt;/a&gt;,<br>        &lt;a href="http://example.com/lacie" class="sister" id="link2"&gt;Lacie&lt;/a&gt; and<br>        &lt;a href="http://example.com/tillie" class="sister" id="link3"&gt;Tillie&lt;/a&gt;;<br>        and they lived at the bottom of a well.<br>    &lt;/p&gt;<br>    &lt;p class="story"&gt;...&lt;/p&gt;<br>"""</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  1、快速操作:</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup.title  == soup.find(<span style="color: #800000">'</span><span style="color: #800000">title</span><span style="color: #800000">'</span><span style="color: #000000">)# </span>&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span><span style="color: #000000">soup.title.name# u</span><span style="color: #800000">'</span><span style="color: #800000">title</span><span style="color: #800000">'</span><span style="color: #000000">
    soup.title.</span><span style="color: #0000ff">string</span>  == soup.title.text  ==<span style="color: #000000"> soup.title.get_text()# u</span><span style="color: #800000">'</span><span style="color: #800000">The Dormouse</span><span style="color: #800000">'</span>s story<span style="color: #800000">'</span><span style="color: #000000">soup.title.parent.name# u</span><span style="color: #800000">'</span><span style="color: #800000">head</span><span style="color: #800000">'</span><span style="color: #000000">
    soup.p   </span>== soup.find(<span style="color: #800000">'</span><span style="color: #800000">p</span><span style="color: #800000">'</span><span style="color: #000000">)  # . 点属性,只能获取当前标签下的第一个标签# </span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span><span style="color: #000000">soup.p[</span><span style="color: #800000">'</span><span style="color: #800000">class</span><span style="color: #800000">'</span><span style="color: #000000">]# u</span><span style="color: #800000">'</span><span style="color: #800000">title</span><span style="color: #800000">'</span><span style="color: #000000">
    soup.a  </span>== soup.find(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span><span style="color: #000000">)# </span>&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">
    soup.find_all(</span><span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span><span style="color: #000000">)# [</span>&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,#  </span>&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000">,#  </span>&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">]<br>soup.find_all(['a','b'])  # 查找所有的a标签和b标签<br>soup.find_all(id=["link1","link2"])  # 查找所有id=link1 和id=link2的标签<br>soup.find(id</span>=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span><span style="color: #000000">)# </span>&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<br><br><br></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>  2、Beautiful Soup对象有四种类型:</p><p>    1、BeautifulSoup</p><p>    2、tag:标签</p><p>    3、NavigableString&nbsp; : 标签中的文本,可包含注释内容</p><p>    4、Comment :标签中的注释,纯注释,没有正文内容</p><p>&nbsp;</p><p>  标签属性的操做跟字典是一样一样的</p><p>  html多值属性(xml不适合):</p><p>    意思为一个属性名称,它是多值的,即包含多个属性值,即使属性中只有一个值也返回值为list,</p><p>    如:class,<tt class="docutils literal"><span class="pre">rel</span></tt>&nbsp;,&nbsp;<tt class="docutils literal"><span class="pre">rev</span></tt>&nbsp;,&nbsp;<tt class="docutils literal"><span class="pre">accept-charset</span></tt>&nbsp;,&nbsp;<tt class="docutils literal"><span class="pre">headers</span></tt>&nbsp;,&nbsp;<tt class="docutils literal"><span class="pre">accesskey</span></tt></p><p>    其它属性为单值属性,即使属性值中有多个空格隔开的值,也是反回一个字符串</p><div class="cnblogs_code"><pre>soup.a[<span style="color: #800000">'</span><span style="color: #800000">class</span><span style="color: #800000">'</span>]  #[<span style="color: #800000">'</span><span style="color: #800000">sister</span><span style="color: #800000">'</span><span style="color: #000000">]

    id_soup </span>= BeautifulSoup(<span style="color: #800000">'</span><span style="color: #800000">&lt;p id="my id"&gt;&lt;/p&gt;</span><span style="color: #800000">'</span><span style="color: #000000">)id_soup.p[</span><span style="color: #800000">'</span><span style="color: #800000">id</span><span style="color: #800000">'</span>]  #<span style="color: #800000">'</span><span style="color: #800000">my id</span><span style="color: #800000">'</span></pre></div><p>&nbsp;</p><p>  3、html中tag内容输出: </p><p>    string:输出单一子标签文本内容或注释内容(选其一,标签中包含两种内容则输出为None)</p><p>    strings: 返回所有子孙标签的文本内容的生成器(不包含注释)</p><p>    stripped_strings:返回所有子孙标签的文本内容的生成器(不包含注释,并且在去掉了strings中的空行和空格)</p><p>    text:只输出文本内容,可同时输出多个子标签内容</p><p>    get_text():只输出文本内容,可同时输出多个子标签内容</p><p>  string:</p><div class="cnblogs_code"><pre>markup = <span style="color: #800000">"</span><span style="color: #800000">&lt;b&gt;&lt;!--Hey, buddy. Want to buy a used parser?--&gt;&lt;/b&gt;</span><span style="color: #800000">"</span><span style="color: #000000">soup </span>= BeautifulSoup(markup, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)comm </span>= soup.b.<span style="color: #0000ff">string</span><span style="color: #000000">print(comm)  # Hey, buddy. Want to buy a used parser?print(type(comm))  #&lt;class 'bs4.element.Comment'&gt;</span></pre></div><p>&nbsp;  strings:</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>head_tag =<span style="color: #000000"> soup.body</span><span style="color: #0000ff">for</span> s <span style="color: #0000ff">in</span><span style="color: #000000"> head_tag.strings:    print(repr(s))
    结果:</span><span style="color: #800000">'</span><span style="color: #800000"> </span><span style="color: #800000">'</span><span style="color: #800000">"</span><span style="color: #800000">The Dormouse's story</span><span style="color: #800000">"</span><span style="color: #800000">'</span><span style="color: #800000"> </span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Once upon a time there were three little sisters; and their names were         </span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Elsie</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">,         </span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Lacie</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000"> and         </span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Tillie</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">;         and they lived at the bottom of a well.     </span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000"> </span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">...</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000"> </span><span style="color: #800000">'</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>  stripped_strings:</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>head_tag =<span style="color: #000000"> soup.body</span><span style="color: #0000ff">for</span> s <span style="color: #0000ff">in</span><span style="color: #000000"> head_tag.stripped_strings:    print(repr(s))
    结果:</span><span style="color: #800000">"</span><span style="color: #800000">The Dormouse's story</span><span style="color: #800000">"</span><span style="color: #800000">'</span><span style="color: #800000">Once upon a time there were three little sisters; and their names were</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Elsie</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">,</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Lacie</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">and</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">Tillie</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">;         and they lived at the bottom of a well.</span><span style="color: #800000">'</span><span style="color: #800000">'</span><span style="color: #800000">...</span><span style="color: #800000">'</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>  text:</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)head_tag </span>=<span style="color: #000000"> soup.bodyprint(head_tag.text)
    结果:The Dormouse</span><span style="color: #800000">'</span><span style="color: #800000">s story</span><span style="color: #000000">Once upon a time there were three little sisters; and their names were        Elsie,        Lacie and        Tillie;        and they lived at the bottom of a well.    ...</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)head_tag </span>=<span style="color: #000000"> soup.bodyprint(repr(head_tag.text))
    结果:</span><span style="color: #800000">"</span><span style="color: #800000"> The Dormouse's story Once upon a time there were three little sisters; and their names were         Elsie,         Lacie and         Tillie;         and they lived at the bottom of a well.     ... </span><span style="color: #800000">"</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>&nbsp;</p><p>  4、返回子节点列表:</p><p>    .contents: 以列表的方式返回节点下的直接子节点</p><p>    .children:以生成器的方式反回节点下的直接子节点</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)head_tag </span>=<span style="color: #000000"> soup.headprint(head_tag)print(head_tag.contents)print(head_tag.contents[</span><span style="color: #800080">0</span><span style="color: #000000">])print(head_tag.contents[</span><span style="color: #800080">0</span><span style="color: #000000">].contents)
    </span><span style="color: #0000ff">for</span> ch <span style="color: #0000ff">in</span><span style="color: #000000"> head_tag.children:    print(ch)
    结果:</span>&lt;head&gt;&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;&lt;/head&gt;</span>[&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;]</span>&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span>[<span style="color: #800000">"</span><span style="color: #800000">The Dormouse's story</span><span style="color: #800000">"</span><span style="color: #000000">]</span>&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  5、返回子孙节点的生成器:</p><p>     .descendants: 以列表的方式返回标签下的子孙节点</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre><span style="color: #0000ff">for</span> ch <span style="color: #0000ff">in</span><span style="color: #000000"> head_tag.descendants:    print(ch)
    结果:</span>&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span>The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  6、父标签(parent):如果是bs4对象,不管本来是标签还是文本都可以找到其父标签,但是文本对象不能找到父标签</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)tag_title </span>=<span style="color: #000000"> soup.b  # b标签print(tag_title.parent)  # b标签的父标签 pprint(type(tag_title.</span><span style="color: #0000ff">string</span>))  # b标签中的文本的类型,文本中有注释时结果为None &lt;<span style="color: #0000ff">class</span> <span style="color: #800000">'</span><span style="color: #800000">bs4.element.NavigableString</span><span style="color: #800000">'</span>&gt;<span style="color: #000000">print(tag_title.</span><span style="color: #0000ff">string</span><span style="color: #000000">.parent)  # b标签中文本的父标签 bprint(type(tag_title.text))  # b 标签中的文本类型为str,无bs4属性找到父标签</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  7、递归父标签(parents):递归得到元素的所有父辈节点</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)link </span>=<span style="color: #000000"> soup.a</span><span style="color: #0000ff">for</span> parent <span style="color: #0000ff">in</span><span style="color: #000000"> link.parents:    print(parent.name)<br><br>结果:<br></span></pre><p>p<br>body<br>html<br>[document]</p>















    <div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  8、前后节点查询(不是前后标签哦,文本也是节点之一):previous_sibling,next_sibling</p><p><img src="https://images2017.cnblogs.com/blog/931154/201801/931154-20180124082140694-1377077553.png" alt=""></p><p>&nbsp;</p><p>&nbsp;  9、以生成器的方式迭代返回所有兄弟节点</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre><span style="color: #0000ff">for</span> sib <span style="color: #0000ff">in</span><span style="color: #000000"> soup.a.next_siblings:    print(sib)    print(</span><span style="color: #800000">"</span><span style="color: #800000">---------</span><span style="color: #800000">"</span><span style="color: #000000">)
    结果:</span>-------------<span style="color: #000000">,        </span>---------&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;---------

    ---------&lt;a <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;---------<span style="color: #000000">;        and they lived at the bottom of a well.    </span>---------</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  10、搜索文档树</p><p>    过滤器:</p><p>      1、字符串</p><p>      2、正则表达式</p><p>      3、列表</p><p>      4、True</p><p>      5、方法</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>html_doc = <span style="color: #800000">"""</span><span style="color: #800000">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;</span>&lt;body&gt;&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;Once upon a time there were three little sisters; and their names were&lt;/p&gt;&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,</span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000"> and</span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">;and they lived at the bottom of a well.
    </span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;...&lt;/p&gt;&lt;/body&gt;<span style="color: #800000">"""</span><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import BeautifulSoupimport resoup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)soup.find_all(</span><span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span><span style="color: #000000">)  # 字符串参数soup.find_all(re.compile(</span><span style="color: #800000">"</span><span style="color: #800000">^b</span><span style="color: #800000">"</span><span style="color: #000000">))  # 正则参数soup.find_all(re.compile(</span><span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span><span style="color: #000000">))  # 正则参数soup.find_all(re.compile(</span><span style="color: #800000">"</span><span style="color: #800000">l$</span><span style="color: #800000">"</span><span style="color: #000000">))  # 正则参数soup.find_all([</span><span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span>, <span style="color: #800000">"</span><span style="color: #800000">b</span><span style="color: #800000">"</span><span style="color: #000000">])  # 标签的列表参数soup.find_all(True)  # 返回所有标签def has_class_no_id(tag):    </span><span style="color: #0000ff">return</span> tag.has_attr(<span style="color: #800000">"</span><span style="color: #800000">class</span><span style="color: #800000">"</span>) and not tag.has_attr(<span style="color: #800000">"</span><span style="color: #800000">id</span><span style="color: #800000">"</span><span style="color: #000000">)soup.find_all(has_class_no_id)  # 方法参数</span></pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p><span style="font-family: 黑体; font-size: 14pt">  11、find选择器:</span></p><p>    语法 :</p><pre><span>    # find_all( name , attrs , recursive , text , **<span>kwargs )    #  name :要查找的标签名    #  attrs: 标签的属性    #  recursive: 递归    #  text: 查找文本    # **<span>kwargs :其它 键值参数<br><br>  特殊情况:<br>    </span></span></span><span class="s">data-foo="value",</span><span class="s">因中横杠不识别的原因,只能写成</span><span class="s">attrs={"data-foo":"value"},</span></pre><pre><span><span><span>    class="value",因class是关键字,所以要写成class_="value"或attrs={"class":"value"}</span></span></span></pre><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import BeautifulSoupimport rehtml_doc </span>= <span style="color: #800000">"""</span>&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;&lt;/head&gt;</span>
    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span>
    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;<span style="color: #000000">Once upon a time there were three little sisters; and their names were</span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,</span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000"> and</span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">;and they lived at the bottom of a well.</span>&lt;/p&gt;
    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;...&lt;/p&gt;<span style="color: #800000">"""</span><span style="color: #000000"># find_all( name , attrs , recursive , text , </span>**<span style="color: #000000">kwargs )#  name :要查找的标签名(字符串、正则、方法、True)#  attrs: 标签的属性#  recursive: 递归#  text: 查找文本# </span>**<span style="color: #000000">kwargs :其它 键值参数soup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)print(soup.find_all(</span><span style="color: #800000">'</span><span style="color: #800000">p</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">title</span><span style="color: #800000">'</span>)) # p标签且class=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span><span style="color: #000000">soup.find_all(</span><span style="color: #800000">'</span><span style="color: #800000">title</span><span style="color: #800000">'</span><span style="color: #000000">)  # 以列表形式返回 所有title标签asoup.find_all(attrs</span>={<span style="color: #800000">"</span><span style="color: #800000">class</span><span style="color: #800000">"</span>:<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span>})  # 以列表形式返回 所有class属性==<span style="color: #000000">sister的标签soup.find_all(id</span>=<span style="color: #800000">'</span><span style="color: #800000">link2</span><span style="color: #800000">'</span>)  # 返回所有id属性==<span style="color: #000000">link2的标签soup.find_all(href</span>=re.compile(<span style="color: #800000">"</span><span style="color: #800000">elsie</span><span style="color: #800000">"</span><span style="color: #000000">)) # 返回所有href属性包含elsie的标签soup.find_all(id</span>=<span style="color: #000000">True)  # 返回 所有包含id属性的标签soup.find_all(id</span>=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>, href=re.compile(<span style="color: #800000">'</span><span style="color: #800000">elsie</span><span style="color: #800000">'</span>))  #  id=link1且href包含elsie</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p><img src="https://images2017.cnblogs.com/blog/931154/201801/931154-20180128222706647-1457600468.png" alt=""></p><pre>关于class的搜索</pre><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)css_soup </span>= BeautifulSoup(<span style="color: #800000">'</span><span style="color: #800000">&lt;p class="body strikeout"&gt;&lt;/p&gt;</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)css_soup.find_all(</span><span style="color: #800000">"</span><span style="color: #800000">p</span><span style="color: #800000">"</span>, class_=<span style="color: #800000">"</span><span style="color: #800000">body</span><span style="color: #800000">"</span><span style="color: #000000">)  # 多值class,指定其中一个即可css_soup.find_all(</span><span style="color: #800000">"</span><span style="color: #800000">p</span><span style="color: #800000">"</span>, class_=<span style="color: #800000">"</span><span style="color: #800000">strikeout</span><span style="color: #800000">"</span><span style="color: #000000">)css_soup.find_all(</span><span style="color: #800000">"</span><span style="color: #800000">p</span><span style="color: #800000">"</span>, class_=<span style="color: #800000">"</span><span style="color: #800000">body strikeout</span><span style="color: #800000">"</span><span style="color: #000000">)  # 精确匹配# text 参数可以是字符串,列表、方法、Truesoup.find_all(</span><span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span>, text=<span style="color: #800000">"</span><span style="color: #800000">Elsie</span><span style="color: #800000">"</span>)  # text=<span style="color: #800000">"</span><span style="color: #800000">Elsie</span><span style="color: #800000">"</span>的a标签</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  12、父节点方法:</p><p>    find_parents(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>    find_parent(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>html_doc = <span style="color: #800000">"""</span><span style="color: #800000">&lt;html&gt;</span>    &lt;head&gt;        &lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span>    &lt;/head&gt;&lt;body&gt;    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span>    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;Once upon a time there were three little sisters; and their names were&lt;/p&gt;    &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,    </span>&lt;p&gt;        &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000"> and    </span>&lt;/p&gt;    &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">;    and they lived at the bottom of a well.    </span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;...&lt;/p&gt;&lt;/body&gt;<span style="color: #800000">"""</span><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import BeautifulSoupsoup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)a_string </span>= soup.find(text=<span style="color: #800000">"</span><span style="color: #800000">Lacie</span><span style="color: #800000">"</span><span style="color: #000000">)  # 文本为Lacie的节点type(a_string), a_string  # </span>&lt;<span style="color: #0000ff">class</span> <span style="color: #800000">'</span><span style="color: #800000">bs4.element.NavigableString</span><span style="color: #800000">'</span>&gt;<span style="color: #000000"> Laciea_parent </span>=<span style="color: #000000"> a_string.find_parent()  # a_string的父节点中的第一个节点a_parent </span>= a_string.find_parent(<span style="color: #800000">"</span><span style="color: #800000">p</span><span style="color: #800000">"</span><span style="color: #000000">)  # a_string的父节点中的第一个p节点a_parents </span>=<span style="color: #000000"> a_string.find_parents()  # a_string的父节点a_parents </span>= a_string.find_parents(<span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span>)  # a_string的父点中所有a节点</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>  13、后面的邻居节点:</p><p>    find_next_siblings(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>    find_next_sibling(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>html_doc = <span style="color: #800000">"""</span><span style="color: #800000">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;</span>&lt;body&gt;    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span>    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;Once upon a time there were three little sisters; and their names were&lt;/p&gt;    &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,    </span>&lt;b href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/b&gt;<span style="color: #000000">,    </span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000"> and    </span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">;        and they lived at the bottom of a well.    </span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;...&lt;/p&gt;&lt;/body&gt;<span style="color: #800000">"""</span><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import BeautifulSoupsoup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)first_link </span>=<span style="color: #000000"> soup.a  # 第一个a标签a_sibling </span>=<span style="color: #000000"> first_link.find_next_sibling()  # 后面邻居的第一个a_sibling </span>= first_link.find_next_sibling(<span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span><span style="color: #000000">)  # 后面邻居的第一个aa_siblings </span>=<span style="color: #000000"> first_link.find_next_siblings()  # 后面的所有邻居a_siblings </span>= first_link.find_next_siblings(<span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span>)  # 后面邻居的所有a邻居</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>&nbsp;  14、前面的邻居节点:</p><p>    find_previous_siblings(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>    find_previous_sibling(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>&nbsp;</p><p>  15、后面的节点:</p><p>    find_all_next(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>    find_next(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>html_doc = <span style="color: #800000">"""</span><span style="color: #800000">&lt;html&gt;</span>    &lt;head&gt;        &lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span>    &lt;/head&gt;&lt;body&gt;    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span>    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;Once upon a time there were three little sisters; and their names were&lt;/p&gt;    &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,    </span>&lt;p&gt;        &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000"> and    </span>&lt;/p&gt;    &lt;p&gt;        &lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">;    </span>&lt;/p&gt;<span style="color: #000000">        and they lived at the bottom of a well.    </span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;...&lt;/p&gt;&lt;/body&gt;<span style="color: #800000">"""</span><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import BeautifulSoupsoup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span><span style="color: #000000">)a_string </span>= soup.find(text=<span style="color: #800000">"</span><span style="color: #800000">Lacie</span><span style="color: #800000">"</span><span style="color: #000000">)a_next </span>=<span style="color: #000000"> a_string.find_next()  # 后面所有子孙标签的第一个a_next </span>= a_string.find_next(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span><span style="color: #000000">)  # 后面所有子孙标签的第一个a标签a_nexts </span>=<span style="color: #000000"> a_string.find_all_next()  # 后面的所有子孙标签a_nexts </span>= a_string.find_all_next(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span>)  # 后面的所有子孙标签中的所有a标签</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p><p>&nbsp;  16、前面的节点:</p><p>    find_all_previous(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>    find_previous(&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a>&nbsp;)</p><p>&nbsp;</p><p>  17、解析部分文档:</p><p>    如果仅仅因为想要查找文档中的&lt;a&gt;标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把&lt;a&gt;标签以外的东西都忽略掉.&nbsp;<tt class="docutils literal"><span class="pre">SoupStrainer</span></tt>&nbsp;类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在&nbsp;<tt class="docutils literal"><span class="pre">SoupStrainer</span></tt>&nbsp;中定义过的文档. 创建一个&nbsp;<tt class="docutils literal"><span class="pre">SoupStrainer</span></tt>&nbsp;对象并作为&nbsp;<tt class="docutils literal"><span class="pre">parse_only</span></tt>&nbsp;参数给&nbsp;<tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt>&nbsp;的构造方法即可。</p><p><tt class="docutils literal"><span class="pre">  SoupStrainer</span></tt>&nbsp;类参数:<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32">name</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css">attrs</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive">recursive</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text">text</a>&nbsp;,&nbsp;<a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword">**kwargs</a></p><div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div><pre>html_doc = <span style="color: #800000">"""</span><span style="color: #800000">&lt;html&gt;</span>    &lt;head&gt;        &lt;title&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/title&gt;</span>    &lt;/head&gt;&lt;body&gt;    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">title</span><span style="color: #800000">"</span>&gt;&lt;b&gt;The Dormouse<span style="color: #800000">'</span><span style="color: #800000">s story&lt;/b&gt;&lt;/p&gt;</span>    &lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;<span style="color: #000000">Once upon a time there were three little sisters; and their names were        </span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/elsie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link1</span><span style="color: #800000">"</span>&gt;Elsie&lt;/a&gt;<span style="color: #000000">,        </span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/lacie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>&gt;Lacie&lt;/a&gt;<span style="color: #000000"> and        </span>&lt;a href=<span style="color: #800000">"</span><span style="color: #800000">http://example.com/tillie</span><span style="color: #800000">"</span> <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">sister</span><span style="color: #800000">"</span> id=<span style="color: #800000">"</span><span style="color: #800000">link3</span><span style="color: #800000">"</span>&gt;Tillie&lt;/a&gt;<span style="color: #000000">;    </span>&lt;/p&gt;<span style="color: #000000">        and they lived at the bottom of a well.    </span>&lt;p <span style="color: #0000ff">class</span>=<span style="color: #800000">"</span><span style="color: #800000">story</span><span style="color: #800000">"</span>&gt;...&lt;/p&gt;&lt;/body&gt;<span style="color: #800000">"""</span><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import SoupStrainera_tags </span>= SoupStrainer(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span><span style="color: #000000">)  # 所有a标签id_tags </span>= SoupStrainer(id=<span style="color: #800000">"</span><span style="color: #800000">link2</span><span style="color: #800000">"</span>)  # id=<span style="color: #000000">link2的标签def is_short_string(</span><span style="color: #0000ff">string</span><span style="color: #000000">):    </span><span style="color: #0000ff">return</span> len(<span style="color: #0000ff">string</span>) &lt; <span style="color: #800080">10</span><span style="color: #000000">  # string长度小于10,返回Trueshort_string </span>= SoupStrainer(text=<span style="color: #000000">is_short_string)  # 符合条件的文本
    </span><span style="color: #0000ff">from</span><span style="color: #000000"> bs4 import BeautifulSoupsoup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span>, parse_only=<span style="color: #000000">a_tags).prettify()soup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span>, parse_only=<span style="color: #000000">id_tags).prettify()soup </span>= BeautifulSoup(html_doc, <span style="color: #800000">'</span><span style="color: #800000">html.parser</span><span style="color: #800000">'</span>, parse_only=short_string).prettify()</pre><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div><p>&nbsp;</p></div>

  • 相关阅读:
    图片上传-下载-删除等图片管理的若干经验总结3-单一业务场景的完整解决方案
    图片上传-下载-删除等图片管理的若干经验总结2
    HDU 1195 Open the Lock
    HDU 1690 Bus System
    HDU 2647 Reward
    HDU 2680 Choose the best route
    HDU 1596 find the safest road
    POJ 1904 King's Quest
    CDOJ 889 Battle for Silver
    CDOJ 888 Absurdistan Roads
  • 原文地址:https://www.cnblogs.com/l-jie-n/p/9749562.html
Copyright © 2011-2022 走看看