zoukankan      html  css  js  c++  java
  • Beautiful Soup 4.4.0 基本使用方法

    Beautiful Soup 4.4.0 基本使用方法
    Beautiful Soup 安装 pip install  beautifulsoup4 标准库有html.parser解析器但速度不是很快一般还需安装第三方的解析器:
    pip install lxml pip install html5lib
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup=BeautifulSoup(html_doc,'html.parser')
    soup.title   #title标签 <title>The Dormouse's story</title>
    soup.title.string  #title标签内容 The Dormouse's story
    soup.title.name#title标签名称 title
    soup.title.parent #head标签 <head><title>The Dormouse's story</title></head> children
    soup.head.contents#取head标签里的所有子标签 <title>The Dormouse's story</title>
    soup.head.contents[0]#取head标签的第一个子标签 
    soup.p #第一个p标签 <p class="title"><b>The Dormouse's story</b></p>
    soup.p['class']#第一个p标签class名称 html解析器返回结果是一个列表[title] xml返回是一个字符串“title”
    soup.prettify() #按照标准的缩进格式的结构完整输出(自动补结尾的</body></html>
    soup.find_all('a')#找所有的a标签
    soup.find(id='link3')#找id为link3的标签
    soup.find_all(["a", "b"])#查找a b标签
    soup.find_all("p", "title")#查找p标签并且属性为title
    soup.find_all("a", class_="sister")
    soup.find_all("a", limit=2)
    soup.get_text()#从文档中获取所有文字内容:
    for a in soup.find_all('a'):
        print a.get('href')  #取得所有a标签的href值
    调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False 
    css选择器:

    通过tag标签逐层查找:

    soup.select("body a")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

    soup.select("html head title")
    # [<title>The Dormouse's story</title>]
    找到某个tag标签下的直接子标签 [6] :

    soup.select("head > title")
    # [<title>The Dormouse's story</title>]

    soup.select("p > a")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

    soup.select("p > a:nth-of-type(2)")
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

    soup.select("p > #link1")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

    soup.select("body > a")
    # []
    找到兄弟节点标签:

    soup.select("#link1 ~ .sister")
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]

    soup.select("#link1 + .sister")
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
    通过CSS的类名查找:

    soup.select(".sister")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

    soup.select("[class~=sister]")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    通过tag的id查找:

    soup.select("#link1")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

    soup.select("a#link2")
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
    同时用多种CSS选择器查询元素:

    soup.select("#link1,#link2")
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
    通过是否存在某个属性来查找:

    soup.select('a[href]')
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    通过属性的值来查找:

    soup.select('a[href="http://example.com/elsie"]')
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

    soup.select('a[href^="http://example.com/"]')
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

    soup.select('a[href$="tillie"]')
    # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

    soup.select('a[href*=".com/el"]')
    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]



  • 相关阅读:
    2、容器初探
    3、二叉树:先序,中序,后序循环遍历详解
    Hebbian Learning Rule
    论文笔记 Weakly-Supervised Spatial Context Networks
    在Caffe添加Python layer详细步骤
    论文笔记 Learning to Compare Image Patches via Convolutional Neural Networks
    Deconvolution 反卷积理解
    论文笔记 Feature Pyramid Networks for Object Detection
    Caffe2 初识
    论文笔记 Densely Connected Convolutional Networks
  • 原文地址:https://www.cnblogs.com/howhy/p/7562425.html
Copyright © 2011-2022 走看看