zoukankan      html  css  js  c++  java
  • 从零开始的python爬虫教程(Day05)

    BeautifulSoup简介

    Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.(摘自BeautifulSoup中文文档)

    和lxml库一样,BeautifulSoup库是一个功能强大的解析库,可以方便地提取各个网页元素,是爬虫的一大利器。

    安装BeautifulSoup库:

    pip install bs4
    

    BeautifulSoup在解析网页时需要解析器。以下是一些BeautifulSoup库支持的解析器:

    解析器使用方法
    python标准库 BeautifulSoup(html, “html.parser”)
    lxml HTML解析器 BeautifulSoup(html, “lxml”)
    lxml xml解析器 BeautifulSoup(html, “xml”)
    html5lib BeautifulSoup(html, “html5lib”)

    BeautifulSoup用法

    导入BeautifulSoup库:

    from bs4 import BeautifulSoup
    import re
    

    实例html网页代码:

    html = """
    <html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="title" name="dromouse">
       <b>
        The Dormouse's story
       </b>
      </p>
      <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">
        <!-- Elsie -->
       </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>
    """
    

    基本用法

    soup = BeautifulSoup(html, 'lxml')
    print(soup.title.string.strip()) # 使用string来获取标签里面的字符串
    print(soup.p.name) # name获取标签名称
    print(soup.p.attrs) # attrs获取标签属性
    
    The Dormouse's story
    p
    {'class': ['title'], 'name': 'dromouse'}
    

    嵌套选择

    print(soup.head.title.string.strip()) # 可以使用[标签1.标签2]的形式对标签1下一层节点进行选择
    
    The Dormouse's story
    

    关联选择(直接获取所选元素的子节点)

    print(soup.p.contents)
    
    ['
    ', <b>
        The Dormouse's story
       </b>, '
    ']
    
    print(soup.body.children)
    for child in soup.body.children:
        print(child)
        print('---------')
    
    <list_iterator object at 0xaddcbe70>
    
    
    
    ---------
    <p class="title" name="dromouse">
    <b>
        The Dormouse's story
       </b>
    </p>
    ---------
    
    
    ---------
    <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
    </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
    ---------
    
    
    
    ---------
    <p class="story">
       ...
      </p>
    ---------
    
    
    
    ---------
    

    find_all find_all(name, attrs, recursive, text, **kwargs)

    (1)name

    print(soup.find_all(name='b')) # name为标签类型
    
    [<b>
        The Dormouse's story
       </b>]
    

    (2)attrs

    print(soup.find_all(attrs = {'class':'sister'})) # 根据标签属性选择标签
    
    [<a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
    </a>, <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>, <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>]
    
    print(soup.find_all(class_ = 'sister')) # 使用标签属性名称加“_”效果相同
    
    [<a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
    </a>, <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>, <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>]
    

    (3)text

    print(soup.find_all('a', {'href': re.compile(r'http://(.*?)')})) # 匹配标签属性的方法
    
    [<a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
    </a>, <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>, <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>]
    
    print(soup.find_all(text = re.compile('Dormouse'))) # 匹配标签内容的方法
    
    ["
       The Dormouse's story
      ", "
        The Dormouse's story
       "]
    

    CSS选择器

    print(soup.select('.sister')) # 选择所有class为sister的标签
    
    [<a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
    </a>, <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>, <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>]
    
  • 相关阅读:
    面试问烂的 MySQL 四种隔离级别,看完吊打面试官!
    一周 GitHub 开源项目推荐:阿里、腾讯、陌陌、bilibili……
    干货收藏 | Java 程序员必备的一些流程图
    IntelliJ IDEA 快捷键终极大全,速度收藏!
    我的天!史上最烂的项目:苦撑 12 年,600 多万行代码...
    模板中如何添加不定个数的常数
    SFINAE简单实例
    Sequentially-consistent ordering
    hierarchical_mutex函数问题(C++ Concurrent in Action)
    不同AI学科之间的联系
  • 原文地址:https://www.cnblogs.com/lbr12218/p/14609053.html
Copyright © 2011-2022 走看看