zoukankan      html  css  js  c++  java
  • PYTHON 爬虫笔记五:BeautifulSoup库基础用法

    知识点一:BeautifulSoup库详解及其基本使用方法

    • 什么是BeautifulSoup

    灵活又方便的网页解析库,处理高效,支持多种解析器。利用它不用编写正则表达式即可方便实现网页信息的提取库。

    • BeautifulSoup中常见的解析库

            

    • 基本用法:

      html = '''
      <html><head><title>The Domouse's story</title></head>
      <body>
      <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
      <p class="story">Once upon a time there were little sisters;and their names were
      <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
      <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
      <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
      and they lived at bottom of a well.</p>
      <p class="story">...</p>
      '''
       
      from bs4 import BeautifulSoup
      soup= BeautifulSoup(html,'lxml')
       
      print(soup.prettify())#格式化代码,打印结果自动补全缺失的代码
      print(soup.title.string)#文章标题
      <html>
       <head>
        <title>
         The Domouse's story
        </title>
       </head>
       <body>
        <p class="title" name="dromouse">
         <b>
          The Dormouse's story
         </b>
        </p>
        <p class="story">
         Once upon a time there were little sisters;and their names were
         <a class="sister" href="http://example.com/elsie" id="link1">
          <!--Elsie-->
         </a>
         <a class="sister" hred="http://example.com/lacle" id="link2">
          Lacle
         </a>
         and
         <a class="sister" hred="http://example.com/tilie" id="link3">
          Tillie
         </a>
         and they lived at bottom of a well.
        </p>
        <p class="story">
         ...
        </p>
       </body>
      </html>
      The Domouse's story
      获得的结果
    1. 标签选择器

      1. 选择元素

        html = '''
        <html><head><title>The Domouse's story</title></head>
        <body>
        <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were little sisters;and their names were
        <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
        <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
        <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
        and they lived at bottom of a well.</p>
        <p class="story">...</p>
        '''
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
        print(soup.title) #<title>The Domouse's story</title> print(type(soup.title)) #<class 'bs4.element.Tag'> print(soup.head) #<head><title>The Domouse's story</title></head> print(soup.p)#当出现多个时,只返回第一个 #<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
      2. 获取标签名称

        html = '''
        <html><head><title>The Domouse's story</title></head>
        <body>
        <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were little sisters;and their names were
        <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
        <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
        <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
        and they lived at bottom of a well.</p>
        <p class="story">...</p>
        '''
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
        print(soup.title.name) #title
      3. 获取属性

        html = '''
        <html><head><title>The Domouse's story</title></head>
        <body>
        <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were little sisters;and their names were
        <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
        <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
        <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
        and they lived at bottom of a well.</p>
        <p class="story">...</p>
        '''
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
        
        print(soup.p.attrs['name'])
                #dromouse
        print(soup.p['name'])
                #dromouse
      4. 获取标签内容

        html = '''
        <html><head><title>The Domouse's story</title></head>
        <body>
        <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were little sisters;and their names were
        <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
        <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
        <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
        and they lived at bottom of a well.</p>
        <p class="story">...</p>
        '''
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
        
        print(soup.p.string)
                #The Dormouse's story
      5. 嵌套选择

        html = '''
        <html><head><title>The Domouse's story</title></head>
        <body>
        <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were little sisters;and their names were
        <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
        <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
        <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
        and they lived at bottom of a well.</p>
        <p class="story">...</p>
        '''
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
        
        print(type(soup.title))
                #<class 'bs4.element.Tag'>
        print(soup.head.title.string)#观察html的代码,其中有一层包含的关系:head(title),那我们就可以用嵌套的形式将其内容打印出来;body(p或是a)
                #The Domouse's story
      6. 子节点和子孙节点

        #获取标签的子节点
        html2 = '''
        <html>
            <head>
                <title>The Domouse's story</title>
            </head>
            <body>
            <p class="story">
                Once upon a time there were little sisters;and their names were
                <a href="http://example.com/elsie" class="sister"id="link1">
                <span>Elsle</span>
                </a>
                <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
                and
                <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
                and they lived at bottom of a well.
                </p>
                <p class="story">...</p>
        '''
        from bs4 import BeautifulSoup
        soup2 = BeautifulSoup(html2,'lxml')
        print(soup2.p.contents)
        ['
                Once upon a time there were little sisters;and their names were
                ', <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsle</span>
        </a>, '
        ', <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>, '
                and
                ', <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>, '
                and they lived at bottom of a well.
                ']
        获得的内容

        另一中方法:

        #获取标签的子节点
        html2 = '''
        <html>
            <head>
                <title>The Domouse's story</title>
            </head>
            <body>
            <p class="story">
                Once upon a time there were little sisters;and their names were
                <a href="http://example.com/elsie" class="sister"id="link1">
                <span>Elsle</span>
                </a>
                <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
                and
                <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
                and they lived at bottom of a well.
                </p>
                <p class="story">...</p>
        '''
        from bs4 import BeautifulSoup
        
        soup = BeautifulSoup(html2,'lxml')
         
        print(soup.children)#不同之处:children实际上是一个迭代器,需要用循环的方式才能将内容取出
         
        for i,child in enumerate(soup.p.children):
            print(i,child)
        <list_iterator object at 0x00000208F026B400>
        0 
                Once upon a time there were little sisters;and their names were
                
        1 <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsle</span>
        </a>
        2 
        
        3 <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
        4 
                and
                
        5 <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
        6 
                and they lived at bottom of a well.
                
        获得的结果

        不同之处:children实际上是一个迭代器,需要用循环的方式才能将内容取出,而子节点只是一个列表

        #获取标签的子孙节点
        html2 = '''
        <html>
            <head>
                <title>The Domouse's story</title>
            </head>
            <body>
            <p class="story">
                Once upon a time there were little sisters;and their names were
                <a href="http://example.com/elsie" class="sister"id="link1">
                <span>Elsle</span>
                </a>
                <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
                and
                <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
                and they lived at bottom of a well.
                </p>
                <p class="story">...</p>
        '''
        from bs4 import BeautifulSoup
        
        soup = BeautifulSoup(html2,'lxml')
            
        print(soup2.p.descendants)#获取所有的子孙节点,也是一个迭代器
         
        for i,child in enumerate(soup2.p.descendants):
            print(i,child)
        子孙节点
        <generator object descendants at 0x00000208F0240AF0>
        0 
                Once upon a time there were little sisters;and their names were
                
        1 <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsle</span>
        </a>
        2 
        
        3 <span>Elsle</span>
        4 Elsle
        5 
        
        6 
        
        7 <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
        8 Lacle
        9 
                and
                
        10 <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
        11 Tillie
        12 
                and they lived at bottom of a well.
        --->获得的结果
      7. 父节点和祖先节点

        #父节点
        html = '''
        <html>
            <head>
                <title>The Domouse's story</title>
            </head>
            <body>
            <p class="story">
                Once upon a time there were little sisters;and their names were
                <a href="http://example.com/elsie" class="sister"id="link1">
                <span>Elsle</span>
                </a>
                <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
                and
                <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
                and they lived at bottom of a well.
                </p>
                <p class="story">...</p>
        '''
        from bs4 import BeautifulSoup
        
        soup = BeautifulSoup(html,'lxml')
        
        print(soup.a.parent)
        父节点
        <p class="story">
                Once upon a time there were little sisters;and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsle</span>
        </a>
        <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
                and
                <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
                and they lived at bottom of a well.
                </p>
        --->获得的结果
        #获取祖先节点
        html = '''
        <html>
            <head>
                <title>The Domouse's story</title>
            </head>
            <body>
            <p class="story">
                Once upon a time there were little sisters;and their names were
                <a href="http://example.com/elsie" class="sister"id="link1">
                <span>Elsle</span>
                </a>
                <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
                and
                <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
                and they lived at bottom of a well.
                </p>
                <p class="story">...</p>
        '''
        from bs4 import BeautifulSoup
        
        soup = BeautifulSoup(html,'lxml')
        print(list(enumerate(soup.a.parents)))#所有祖先节点(爸爸也算)
        祖先节点
        [(0, <p class="story">
                Once upon a time there were little sisters;and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsle</span>
        </a>
        <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
                and
                <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
                and they lived at bottom of a well.
                </p>), (1, <body>
        <p class="story">
                Once upon a time there were little sisters;and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsle</span>
        </a>
        <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
                and
                <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
                and they lived at bottom of a well.
                </p>
        <p class="story">...</p>
        </body>), (2, <html>
        <head>
        <title>The Domouse's story</title>
        </head>
        <body>
        <p class="story">
                Once upon a time there were little sisters;and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsle</span>
        </a>
        <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
                and
                <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
                and they lived at bottom of a well.
                </p>
        <p class="story">...</p>
        </body></html>), (3, <html>
        <head>
        <title>The Domouse's story</title>
        </head>
        <body>
        <p class="story">
                Once upon a time there were little sisters;and their names were
                <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsle</span>
        </a>
        <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
                and
                <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
                and they lived at bottom of a well.
                </p>
        <p class="story">...</p>
        </body></html>)]
        --->获得的内容
      8. 兄弟节点

        #获取前兄弟节点
        html = '''
        <html>
            <head>
                <title>The Domouse's story</title>
            </head>
            <body>
            <p class="story">
                Once upon a time there were little sisters;and their names were
                <a href="http://example.com/elsie" class="sister"id="link1">
                <span>Elsle</span>
                </a>
                <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
                and
                <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
                and they lived at bottom of a well.
                </p>
                <p class="story">...</p>
        '''
        from bs4 import BeautifulSoup
        
        soup = BeautifulSoup(html,'lxml')
         
        #兄弟节点(与之并列的节点)
        print(list(enumerate(soup.a.previous_siblings)))#前面的兄弟节点
        前兄弟节点
        [(0, '
                Once upon a time there were little sisters;and their names were
                ')]
        --->获得的内容
        html = '''
        <html>
            <head>
                <title>The Domouse's story</title>
            </head>
            <body>
            <p class="story">
                Once upon a time there were little sisters;and their names were
                <a href="http://example.com/elsie" class="sister"id="link1">
                <span>Elsle</span>
                </a>
                <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
                and
                <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
                and they lived at bottom of a well.
                </p>
                <p class="story">...</p>
        '''
        from bs4 import BeautifulSoup
        
        soup = BeautifulSoup(html,'lxml')
         
        #兄弟节点(与之并列的节点)
        print(list(enumerate(soup.a.next_siblings)))#后面的兄弟节点
        后面兄弟节点
        [(0, '
        '), (1, <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>), (2, '
                and
                '), (3, <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>), (4, '
                and they lived at bottom of a well.
                ')]
        --->获得的结果
    2.  标准选择器

      find_all(name,attrs,recursive,text,**kwargs)

        可以根据标签名,属性,内容查找文档

      1. 根据name查找

        html = '''
        <div class="panel">
            <div class="panel-heading"name="elements">
                <h4>Hello</h4>
            </div>
            <div class="panel-body">
                <ul class="list"Id="list-1">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                    <li class="element">Jay</li>
                </ul>
                <ul class="list list-small"Id="list-2">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                </ul>
            </div>
        <div>
        '''
        from bs4 import BeautifulSoup 
        soup = BeautifulSoup(html,'lxml')
         
        print(soup.find_all('ul'))#列表类型
        print(type(soup.find_all('ul')[0]))
        [<ul class="list" id="list-1">
        <li class="element">Foo</li>
        <li class="element">Bar</li>
        <li class="element">Jay</li>
        </ul>, <ul class="list list-small" id="list-2">
        <li class="element">Foo</li>
        <li class="element">Bar</li>
        </ul>]
        <class 'bs4.element.Tag'>
        获得的结果
        html = '''
        <div class="panel">
            <div class="panel-heading"name="elements">
                <h4>Hello</h4>
            </div>
            <div class="panel-body">
                <ul class="list"Id="list-1">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                    <li class="element">Jay</li>
                </ul>
                <ul class="list list-small"Id="list-2">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                </ul>
            </div>
        <div>
        '''
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
         
        for ul in soup.find_all('ul'):
            print(ul.find_all('li'))#层层嵌套的查找
        [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
        [<li class="element">Foo</li>, <li class="element">Bar</li>]
        获得的结果
      2. 根据attrs查找

        html = '''
        <div class="panel">
            <div class="panel-heading">
                <h4>Hello</h4>
            </div>
            <div class="panel-body">
                <ul class="list"id="list-1" name="elements">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                    <li class="element">Jay</li>
                </ul>
                <ul class="list list-small"id="list-2">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                </ul>
            </div>
        <div>
        '''
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
        
        print(soup.find_all(attrs={'id':'list-1'}))
        print(soup.find_all(attrs={'name':'elements'}))
        [<ul class="list" id="list-1" name="elements">
        <li class="element">Foo</li>
        <li class="element">Bar</li>
        <li class="element">Jay</li>
        </ul>]
        [<ul class="list" id="list-1" name="elements">
        <li class="element">Foo</li>
        <li class="element">Bar</li>
        <li class="element">Jay</li>
        </ul>]
        获得的结果

        另一种方式

        html = '''
        <div class="panel">
            <div class="panel-heading">
                <h4>Hello</h4>
            </div>
            <div class="panel-body">
                <ul class="list"id="list-1">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                    <li class="element">Jay</li>
                </ul>
                <ul class="list list-small"id="list-2">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                </ul>
            </div>
        <div>
        '''
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
         
        print(soup.find_all(id='list-1'))
        print(soup.find_all(class_='element'))
         
        另一种方式
        [<ul class="list" id="list-1">
        <li class="element">Foo</li>
        <li class="element">Bar</li>
        <li class="element">Jay</li>
        </ul>]
        [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
        --->获得的结果
      3. 根据text查找

        #text
        html = '''
        <div class="panel">
            <div class="panel-heading">
                <h4>Hello</h4>
            </div>
            <div class="panel-body"name="elelments">
                <ul class="list"Id="list-1">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                    <li class="element">Jay</li>
                </ul>
                <ul class="list list-small"Id="list-2">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                </ul>
            </div>
        <div>
        ''' 
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
         
        print(soup.find_all(text='Foo'))
                #['Foo', 'Foo']
        find(name,attrs,recursive,text,**kwargs)返回单个元素,find_all返回所有元素
        html = '''
        <div class="panel">
            <div class="panel-heading">
                <h4>Hello</h4>
            </div>
            <div class="panel-body"name="elelments">
                <ul class="list"Id="list-1">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                    <li class="element">Jay</li>
                </ul>
                <ul class="list list-small"Id="list-2">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                </ul>
            </div>
        <div>
        '''
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
              
        print(soup.find('ul'))
        print(type(soup.find('ul')))
        print(soup.find('page'))
        <ul class="list" id="list-1">
        <li class="element">Foo</li>
        <li class="element">Bar</li>
        <li class="element">Jay</li>
        </ul>
        <class 'bs4.element.Tag'>
        None
        获得的结果
      4. 其他方法

        如果使用find方法,返回单个元素 
        
        find_parents()返回所有祖先节点 
        find_parent()返回直接父节点 
        find_next_siblings()返回后面所有兄弟节点 
        find_next_sibling()返回后面第一个兄弟节点 
        find_previous_siblings()返回前面所有的兄弟节点 
        find_previous_sibling()返回前面第一个的兄弟节点 
        find_all_next()返回节点后所有符合条件的节点 
        find_next()返回节点后第一个符合条件的节点 
        find_all_previous()返回节点后所有符合条件的节点 
        find_previous()返回第一个符合条件的节点 
    3. CSS选择器(通过select()直接传入CSS选择器即可完成选择)

      1. html = '''
        <div class="panel">
            <div class="panel-heading">
                <h4>Hello</h4>
            </div>
            <div class="panel-body"name="elelments">
                <ul class="list"id="list-1">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                    <li class="element">Jay</li>
                </ul>
                <ul class="list list-small"id="list-2">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                </ul>
            </div>
        <div>
        '''
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
        
        print(soup.select('.panel .panel-heading')) #class就需要加一个“.”
        print(soup.select('ul li')) #选择标签
        print(soup.select('#list-2 .element'))
        print(type(soup.select('ul')[0]))  
        [<div class="panel-heading">
        <h4>Hello</h4>
        </div>]
        [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
        [<li class="element">Foo</li>, <li class="element">Bar</li>]
        <class 'bs4.element.Tag'>
        获得的结果

        另一种方法:

        html = '''
        <div class="panel">
            <div class="panel-heading">
                <h4>Hello</h4>
            </div>
            <div class="panel-body"name="elelments">
                <ul class="list"Id="list-1">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                    <li class="element">Jay</li>
                </ul>
                <ul class="list list-small"Id="list-2">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                </ul>
            </div>
        <div>
        '''
         
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
         
        for ul in soup.select('ul'):#直接print(soup.select('ul li'))
            print(ul.select('li'))
        另一种方法
        [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
        [<li class="element">Foo</li>, <li class="element">Bar</li>]
        --->获得的结果
      2. 获取属性

        html = '''
        <div class="panel">
            <div class="panel-heading">
                <h4>Hello</h4>
            </div>
            <div class="panel-body"name="elelments">
                <ul class="list"id="list-1">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                    <li class="element">Jay</li>
                </ul>
                <ul class="list list-small"id="list-2">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                </ul>
            </div>
        <div>
        '''
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
        
        for ul in soup.select('ul'):
            print(ul['id'])#直接用[]
            print(ul.attrs['id'])#或是attrs+[]
        list-1
        list-1
        list-2
        list-2
        获得的结果
      3. 获取内容

        html = '''
        <div class="panel">
            <div class="panel-heading">
                <h4>Hello</h4>
            </div>
            <div class="panel-body"name="elelments">
                <ul class="list"Id="list-1">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                    <li class="element">Jay</li>
                </ul>
                <ul class="list list-small"Id="list-2">
                    <li class="element">Foo</li>
                    <li class="element">Bar</li>
                </ul>
            </div>
        <div>
        '''
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html,'lxml')
        
        for li in soup.select('li'):
            print(li['class'], li.get_text())
        ['element'] Foo
        ['element'] Bar
        ['element'] Jay
        ['element'] Foo
        ['element'] Bar
        获得的结果
    • 总结

    推荐使用'lxml'解析库,必要时使用html.parser

    标签选择器筛选功能但速度快

    建议使用find(),find_all()查询匹配单个结果或者多个结果

    如果对CSS选择器熟悉建议选用select()

    记住常用的获取属性和文本值得方法

     

    这都是我对自己学习过程的理解,如有错误请指出!我算一个小白了。
  • 相关阅读:
    简单组网(根据MAC地址划分VLAN)
    简单组网(根据接口划分VLAN)
    简单组网(LACP)负载分担链路聚合
    简单组网(Eth-Trunk)负载分担链路聚合
    《数字图像处理_第三版_中_冈萨雷斯》第一章笔记
    安全测试类型
    全链路压测
    容量测试与容量规划
    性能测试详细介绍
    树莓派vnc连接,放歌调节声音
  • 原文地址:https://www.cnblogs.com/darwinli/p/9445307.html
Copyright © 2011-2022 走看看