zoukankan      html  css  js  c++  java
  • Python爬虫之Beautiful Soup解析库的使用(五)

    Python爬虫之Beautiful Soup解析库的使用

    Beautiful Soup-介绍


    Python第三方库,用于从HTML或XML中提取数据
    官方:http://www.crummv.com/software/BeautifulSoup/

    安装:pip install beautifulsoup4

    Beautiful Soup-语法

    soup = BeautifulSoup(html_doc,'html.parser‘,from_encoding='utf-8' )

    第一个参数:html文档字符串

    第二个参数:html解析器

    第三个参数:html文档的编码

    Beautiful Soup-使用


    标签选择器操作

    注意:只会返回一个指定的标签,这也是标签选择器的特性

    选择元素

    from bs4 import BeautifulSoup
    html_doc='''
    <div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推荐<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新时代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娱乐<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home?
    data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">财经<span></span></a></li>
    '''
    soup = BeautifulSoup(html_doc,'lxml')
    #将html代码自动补全,并按html代码格式返回 print(soup.prettify())
    #输出第一个a标签 print(soup.a)
    #输出第一个span标签 print(soup.span)

      

    运行结果如下:

    <html>
     <body>
      <div class="container">
       <a class="logo" href="/pc/home?sign=360_79aabe15">
       </a>
       <nav data-mod="nnav" id="nnav">
        <div class="nnav-wrap">
         <ul class="nnav-items" id="nnav_main">
          <li data-index="0">
           <a class="nnav-item" data-ch="youlike" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank">
            推荐
            <span>
            </span>
           </a>
          </li>
          <li data-index="1">
           <a class="nnav-item" data-ch="good_safe2toera" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank">
            新时代
            <span>
            </span>
           </a>
          </li>
          <li data-index="2">
           <a class="nnav-item" data-ch="fun" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank">
            娱乐
            <span>
            </span>
           </a>
          </li>
          <li data-index="3">
           <a class="nnav-item" href="/pc/home?
    data-index=">
           </a>
           <a class="nnav-item" data-ch="economy" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank">
            财经
            <span>
            </span>
           </a>
          </li>
         </ul>
        </div>
       </nav>
      </div>
     </body>
    </html>
    <a class="logo" href="/pc/home?sign=360_79aabe15"></a>
    <span></span>
    

      

    获取名称

    获取属性

    获取内容

    from bs4 import BeautifulSoup
    html_doc='''
    <div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推荐<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新时代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娱乐<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home?
    data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">财经<span></span></a></li>
    '''
    soup = BeautifulSoup(html_doc,'lxml')
    #输出第一个a标签的name
    print(soup.a.name)
    #输出第一个a标签的的class属性值,下面两种方法都可以
    print(soup.a.attrs['class'])
    print(soup.a['class'])
    #输出第一个a标签的内容
    print(soup.a.string)
    

      

    运行结果如下:

    a
    ['logo']
    ['logo']
    None
    

      

    嵌套选择

    from bs4 import BeautifulSoup
    html_doc='''
    <a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike"><span>推荐</span></a>
    '''
    soup = BeautifulSoup(html_doc,'lxml')
    print(soup.a.span.string)
    

      

    运行结果如下:

    推荐
    

      

    子节点和子孙节点操作

    获取所有的子节点

    from bs4 import BeautifulSoup
    html='''
    <div class="bc">
        <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新东方在线网络课堂"></a></span>
        <span class="fl" style="padding-top: 6px;">
            <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四级</a>
            <a title="新东方在线网络课堂" href="http://www.koolearn.com/" target="_self">新东方在线</a> >
            <a title="四级网络课堂" href="http://cet4.koolearn.com/" target="_self">四级</a> >
            <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文
        </span>
        <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a> 
    </div>
    '''
    
    soup = BeautifulSoup(html,'lxml')
    #第一种方法
    print(soup.div.contents)
    #第二种方法
    print(soup.div.children)
    for i,child in enumerate(soup.div.children):
       print(i,child)
    

    运行结果如下:

    ['
    ', <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>, '
    ', <span class="fl" style="padding-top: 6px;">
    <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四级</a>
    <a href="http://www.koolearn.com/" target="_self" title="新东方在线网络课堂">新东方在线</a> >
            <a href="http://cet4.koolearn.com/" target="_self" title="四级网络课堂">四级</a> >
            <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文
        </span>, '
    ', <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a>, '
    ']
    <list_iterator object at 0x0000000002E498D0>
    0 
    
    1 <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>
    2 
    
    3 <span class="fl" style="padding-top: 6px;">
    <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四级</a>
    <a href="http://www.koolearn.com/" target="_self" title="新东方在线网络课堂">新东方在线</a> >
            <a href="http://cet4.koolearn.com/" target="_self" title="四级网络课堂">四级</a> >
            <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文
        </span>
    4 
    
    5 <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a>
    6 
    

      

     获取所有的子孙节点

    from bs4 import BeautifulSoup
    html='''
    <div class="bc">
        <span class="fl" style="padding-top: 1px;">
          <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新东方在线网络课堂"></a></span>
          <span class="fl" style="padding-top: 6px;">
        <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四级</a>
        <a title="新东方在线网络课堂" href="http://www.koolearn.com/" target="_self">新东方在线</a> >
        <a title="四级网络课堂" href="http://cet4.koolearn.com/" target="_self">四级</a> >
        <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文</span>
        <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a>  </div>
    '''
    soup = BeautifulSoup(html,'lxml')
    print(soup.div.descendants)
    for i,child in enumerate(soup.div.descendants):
       print(i,child)
    

     

    运行结果如下:

    <generator object descendants at 0x00000000028F5AF0>
    0 
    
    1 <span class="fl" style="padding-top: 1px;">
    <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>
    2 
    
    3 <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a>
    4 <img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/>
    5 
    
    6 <span class="fl" style="padding-top: 6px;">
    <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四级</a>
    <a href="http://www.koolearn.com/" target="_self" title="新东方在线网络课堂">新东方在线</a> >
        <a href="http://cet4.koolearn.com/" target="_self" title="四级网络课堂">四级</a> >
        <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文</span>
    7 
    
    8 <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四级</a>
    9 四级
    10 
    
    11 <a href="http://www.koolearn.com/" target="_self" title="新东方在线网络课堂">新东方在线</a>
    12 新东方在线
    13  >
        
    14 <a href="http://cet4.koolearn.com/" target="_self" title="四级网络课堂">四级</a>
    15 四级
    16  >
        
    17 <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a>
    18 英语四级词汇
    19  > 正文
    20 
    
    21 <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a>
    22 <img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/>
    23  
    

      

    父节点和祖先节点操作

    获取父节点和祖先节点

    from bs4 import BeautifulSoup
    html='''
    <div class="bc">
        <span class="fl" style="padding-top: 1px;">
          <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新东方在线网络课堂"></a></span>
          <span class="fl" style="padding-top: 6px;">
        <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四级</a>
        <a title="新东方在线网络课堂" href="http://www.koolearn.com/" target="_self">新东方在线</a> >
        <a title="四级网络课堂" href="http://cet4.koolearn.com/" target="_self">四级</a> >
        <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文</span>
        <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a>  </div>
    '''
    soup = BeautifulSoup(html,'lxml')
    print(soup.a.parent) #获取父节点
    print(soup.a.parents) #获取祖先节点
    

    运行结果如下:  

    <span class="fl" style="padding-top: 1px;">
    <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>
    <generator object parents at 0x00000000028C5B48>
    

      

    兄弟节点操作

    获取兄弟节点

    from bs4 import BeautifulSoup
    html='''
    <div class="more_box" id="moreBox">
           <h3>360识图</h3>
            <a href="javascript:;" id="btnLoadMore" class="btn_loadmore">加载更多</a>
            <p id="imgTotal" class="img_total">找到相关图片约 2637 张</p>
    </div>
    '''
    soup = BeautifulSoup(html,'lxml')
    print(soup.a.next_siblings) #获取前面的兄弟节点
    print(soup.a.previous_siblings) #获取后面的兄弟节点
    

      

    运行结果如下:

    <generator object next_siblings at 0x0000000002885B48>
    <generator object previous_siblings at 0x0000000002885B48>
    

      

     

    python生成器generator 

    l = [x * x for x in range(10)]
    g = (x * x for x in range(10))
    print(l)
    print(g)
    

    运行结果如下:

    [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
    <generator object <genexpr> at 0x000000000251C468>
    

    L 是一个list, 而 G 是一个generator:它们在创建时候最基本的不同就list是 [ ] ,而generator是 ( ) 

    如果要一个个打印出来,可以通过next()函数来获得generator的下一个返回值

    g = (x * x for x in range(10))
    for i in range(10):
       print(next(g))
    

      

    运行结果如下

    0
    1
    4
    9
    16
    25
    36
    49
    64
    81
    

      

    标准选择器操作



    #可根据标签名、属性、内容查找文档,返回所有匹配结果
    find_all(name,attrs,recusive,text,**kwargs)
    
    
    #查找所有标签为a的节点
    soup.find_all('a')
    
    #查找所有标签为a,链接符合/view/123/htm形式的节点
    soup.find_all('a',href='/view/123.htm')
    soup.find_all('a',href=re.compile(r'/view/d+.htm'))
    
    #查找所有标签为div,class为abc,文字为python的节点
    soup.find_all('div',class_='abc',string='python')
    
    属性:
    #获取查到的节点的标签名称
    node.name
    
    #获取查找到的a节点的href属性
    node['href']
    
    #获取查找到的a节点的链接文字
    node.get_text()
    
    
    find(name,attrs,recusive,text,**kwargs)
    可根据标签名、属性、内容查找文档,和find_all使用方法差不多,只不过返回第一个符合匹配的结果
    
    find_parents() find_parent()
    find_parents()返回所有祖先节点 ,find_parent()返回直接父节点
    
    find_next_siblings() find_next_sibling()
    find_next_siblings()返回前面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点
    
    find_previous_siblings() find_previous_sibling()
    find_previous_siblings()返回前面所有兄弟节点 , find_previous_sibling()返回前面第一个兄弟节点
    
    find_all_next() find_next()
    find_all_next()返回节点后所有符合条件的节点 , find_next()返回第一个符合条件的节点
    
    find_all_previous() find_previous()
    find_all_previous()返回节点后所有符合条件的节点 ,find_previous()返回第一个符合条件的节点
    

      

    测试实例:

    import bs4
    html_doc='''
    <div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推荐<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新时代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娱乐<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home?
    data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">财经<span></span></a></li><li data-index="5"><a class="nnav-item" href="/pc/home?ch=estate&sign=360_79aabe15" target="_blank" data-ch="estate">房产<span></span></a></li><li data-index="6"><a class="nnav-item" href="/pc/home?ch=car&sign=360_79aabe15" target="_blank" data-ch="car">汽车<span></span></a></li><li data-index="7"><a class="nnav-item" href="/pc/home?ch=sport&sign=360_79aabe15" target="_blank" data-ch="sport">体育<span></span></a></li><li data-index="8"><a class="nnav-item" href="/pc/home?ch=domestic&sign=360_79aabe15" target="_blank" data-ch="domestic">国内
    '''
    #创建BeautifulSoup对象
    soup = bs4.BeautifulSoup(html_doc,'html.parser')


    #获取所有的链接
    links = soup.find_all('a')
    for link in links:
    print(link.name,link['href'],link.get_text())

    #获取/pc/home?sign=360_79aabe15的链接
    link_node = soup.find('a',href='/pc/home?sign=360_79aabe15')
    print(link_node.name,link_node['href'],link_node.get_text())

      

    运行结果如下:

    a /pc/home?sign=360_79aabe15 
    a /pc/home?ch=youlike&sign=360_79aabe15 推荐
    a /pc/home?ch=good_safe2toera&sign=360_79aabe15 新时代
    a /pc/home?ch=fun&sign=360_79aabe15 娱乐
    a /pc/home?
    data-index= 财经
    a /pc/home?ch=economy&sign=360_79aabe15 财经
    a /pc/home?ch=estate&sign=360_79aabe15 房产
    a /pc/home?ch=car&sign=360_79aabe15 汽车
    a /pc/home?ch=sport&sign=360_79aabe15 体育
    a /pc/home?ch=domestic&sign=360_79aabe15 国内
    
    a /pc/home?sign=360_79aabe15 
    

      

  • 相关阅读:
    工业相机基础知识
    软件测试最常用的 SQL 命令 | 掌握基本查询、条件查询、聚合查询
    一文掌握软件测试常用SQL命令
    PageObject设计模式在 UI 自动化中的实践(QQ 邮箱登陆为例)
    测试开发必备--搞定PO设计模式
    Junit5 + YAML 参数化和数据驱动,让 App 自动化测试更高效(一)
    快速搞定APP移动端自动化测试
    接口自动化测试的 “能” 与 “不能”
    如何精通接口测试?
    测试开发必备:Dubbo-admin+Zookeeper 的环境搭建实操
  • 原文地址:https://www.cnblogs.com/-wenli/p/9878610.html
Copyright © 2011-2022 走看看