zoukankan      html  css  js  c++  java
  • Beautiful Soup 4.4.0 基本使用方法

    Beautiful Soup 4.4.0 基本使用方法
    Beautiful Soup 安装 pip install  beautifulsoup4 标准库有html.parser解析器但速度不是很快一般还需安装第三方的解析器:
    pip install lxml pip install html5lib
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup=BeautifulSoup(html_doc,'html.parser')
    soup.title   #title标签 <title>The Dormouse's story</title>
    soup.title.string  #title标签内容 The Dormouse's story
    soup.title.name#title标签名称 title
    soup.title.parent #head标签 <head><title>The Dormouse's story</title></head> children
    soup.head.contents#取head标签里的所有子标签 <title>The Dormouse's story</title>
    soup.head.contents[0]#取head标签的第一个子标签 
    soup.p #第一个p标签 <p class="title"><b>The Dormouse's story</b></p>
    soup.p['class']#第一个p标签class名称 html解析器返回结果是一个列表[title] xml返回是一个字符串“title”
    soup.prettify() #按照标准的缩进格式的结构完整输出(自动补结尾的</body></html>
    soup.find_all('a')#找所有的a标签
    soup.find(id='link3')#找id为link3的标签
    soup.find_all(["a", "b"])#查找a b标签
    soup.find_all("p", "title")#查找p标签并且属性为title
    soup.find_all("a", class_="sister")
    soup.find_all("a", limit=2)
    soup.get_text()#从文档中获取所有文字内容:
    for a in soup.find_all('a'):
        print a.get('href')  #取得所有a标签的href值
    调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False 
    css选择器:

    通过tag标签逐层查找:

    soup.select("body a")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

    soup.select("html head title")
    # [<title>The Dormouse's story</title>]
    找到某个tag标签下的直接子标签 [6] :

    soup.select("head > title")
    # [<title>The Dormouse's story</title>]

    soup.select("p > a")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

    soup.select("p > a:nth-of-type(2)")
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

    soup.select("p > #link1")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

    soup.select("body > a")
    # []
    找到兄弟节点标签:

    soup.select("#link1 ~ .sister")
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]

    soup.select("#link1 + .sister")
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
    通过CSS的类名查找:

    soup.select(".sister")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

    soup.select("[class~=sister]")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    通过tag的id查找:

    soup.select("#link1")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

    soup.select("a#link2")
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
    同时用多种CSS选择器查询元素:

    soup.select("#link1,#link2")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
    通过是否存在某个属性来查找:

    soup.select('a[href]')
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    通过属性的值来查找:

    soup.select('a[href="http://example.com/elsie"]')
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

    soup.select('a[href^="http://example.com/"]')
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

    soup.select('a[href$="tillie"]')
    # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

    soup.select('a[href*=".com/el"]')
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]



  • 相关阅读:
    Leecode刷题之旅-C语言/python-67二进制求和
    maven 聚合
    maven 继承
    maven 常用命令
    maven 术语
    maven安装
    RabbitMQ 消费消息
    RabbitMQ 生产消息并放入队列
    RabbitMQ 在 web 页面 创建 exchange, queue, routing key
    mybatis 通过实体类进行查询
  • 原文地址:https://www.cnblogs.com/howhy/p/7562425.html
Copyright © 2011-2022 走看看