zoukankan      html  css  js  c++  java
  • BeautifulSoup使用总结

    一、介绍

    BeautifulSoup为一个python库,它可以接收一个HTML或XML的字符串或文件,并返回一个BeautifulSoup对象,之后我们可以使用BeautifulSoup提供的众多方法来对文件内容进行解析。

    二、安装

    1、使用pip安装

    pip install beautifulsoup4
    #安装BeautifulSoup解析器
    pip install lxml
    pip install html5lib
    

    2、通过apt-get安装

    sudo apt-get install Python-bs4
    #安装BeautifulSoup解析器
    sudo apt-get install Python-lxml
    sudo apt-get install Python-html5lib
    

    推荐使用lxml作为解析器,因为其效率更高。

    三、常用方法

    下面的例子将解析以下字符串:

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """
    

    1、将字符串包装厂BeautifulSoup对象

    soup = BeautifulSoup(html, "lxml")
    #使用标准的缩进结构输出
    print soup.prettify()
    

    输出:

    <html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="title">
       <b>
        The Dormouse's story
       </b>
      </p>
      <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">
        Elsie
       </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>
    

    2、使用name获取标签名称

    print soup.a
    print soup.a.name
    

    输出:

    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    a
    

    需要注意的是,使用soup.[tag]来访问标签只会返回第一个名为tag的标签,若想返回所有的或者根据条件返回,可以使用find_all()方法。

    3、使用string获取标签内容

    通过访问标签的string属性可以获取标签的内容。

    print soup.title.string
    

    输出:

    The Dormouse's story
    

    需要注意的是使用string来访问标签内容时,该标签内只能包含一个子节点,若有多个子节点,使用string会返回None,因为不知道该返回哪个子节点的内容。

    print soup.body.string
    

    输出:

    None
    

    string换成strings即可:

    strings = soup.body.strings
    for string in strings:
        print string
    

    输出:

    
    
    The Dormouse's story
    
    
    Once upon a time there were three little sisters; and their names were
    
    Elsie
    ,
    
    Lacie
     and
    
    Tillie
    ;
    and they lived at the bottom of a well.
    
    
    ...
    
    

    可以看到输出有很多多余的空行和空格,使用stripped_strings即可去除这些空行和空格:

    strings = soup.body.stripped_strings
    for string in strings:
        print string
    

    输出:

    The Dormouse's story
    Once upon a time there were three little sisters; and their names were
    Elsie
    ,
    Lacie
    and
    Tillie
    ;
    and they lived at the bottom of a well.
    ...
    

    4、获取标签的属性名称

    #获取第一个<p>标签的class属性
    soup.p["class"]
    

    输出:

    ['title']
    

    返回的为一个列表,因为class可能有多个值。

    #获取第一个<a>标签的href属性
    soup.a["href"]
    

    输出:

    'http://example.com/elsie'
    

    5、更改标签的属性值

    #更改第一个<p>标签的href属性
    soup.p["class"] = "new-class"
    print soup.p["class"]
    
    #更改第一个<a>标签的href属性
    soup.a["href"] = "www.google.com"
    print soup.a["href"]
    
    print soup.prettify()
    

    输出:

    new-class
    www.google.com
    <html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="new-class">
       <b>
        The Dormouse's story
       </b>
      </p>
      <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="www.google.com" id="link1">
        Elsie
       </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>
    

    6、find_all方法

    6.1 返回所有的标签

    #返回文档中所有的<a>标签,返回值为列表
    links = soup.find_all("a")
    print links
    

    输出:

    [<a class="sister" href="www.google.com" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    

    6.2、根据属性名返回标签

    #返回文档中所有的类名为sister的<a>标签,返回值为列表
    #class为python关键字,所以使用class_代替
    links = soup.find_all("a", class_="sister")
    print links
    print '-'*20
    #与上面的相同
    links = soup.find_all("a", attrs={"class":"sister"})
    print links
    print '-'*20
    #返回文档中所有的id为link2的<a>标签,返回值为列表
    links = soup.find_all("a", id="link2")
    print links
    

    输出:

    [<a class="sister" href="www.google.com" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    --------------------
    [<a class="sister" href="www.google.com" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    --------------------
    [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
    

    6.3、获取所有标签的href属性

    links = soup.find_all("a")
    for a in links:
        print a["href"]
    

    输出:

    www.google.com
    http://example.com/lacie
    http://example.com/tillie
    

    三、参考

    1、https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

  • 相关阅读:
    Spring小结
    ByteBuffer使用之道
    NIO组件Selector调用实例
    NIO组件Selector详解
    NIO机制总结
    NIO组件Selector工作机制详解(下)
    javascriptBOM_DOM
    【前端】javascript基础学习
    【前端】CSS基础学习
    【mongodb】比较符及修改器
  • 原文地址:https://www.cnblogs.com/sench/p/9450407.html
Copyright © 2011-2022 走看看