zoukankan      html  css  js  c++  java
  • python爬虫之BeautifulSoup的HTML解析

      BeautifulSoup是一个用于从HTML和XML文件中提取数据的python库,它提供一些简单的函数来处理导航、搜索、修改分析树等功能。BeautifulSoup能自动将文档转换成Unicode编码,输出文档转换为UTF-8编码。

      本例直接创建模拟HTML代码,进行美化:

    # 导入BeautifulSoup库
    from bs4 import BeautifulSoup
    
    # 创建模拟HTML代码的字符串
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """
    # 创建一个BeautifulSoup对象,获取页面正文
    soup = BeautifulSoup(html_doc, features="lxml")
    # 打印解析的HTML代码
    print('经BeautifulSoup美化的代码:',soup)
    print('===================================')
    print('通过prettify()方法进行代码的格式化处理:',soup.prettify())

    结果:

    经BeautifulSoup美化的代码: <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    </body></html>
    ===================================
    通过prettify()方法进行代码的格式化处理: <html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="title">
       <b>
        The Dormouse's story
       </b>
      </p>
      <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">
        Elsie
       </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>

  • 相关阅读:
    Flex动画
    八大排序算法
    Android switch语句“case expressions must be constant expressions”
    MySQL修改root密码的多种方法
    Android中ListView控件onItemClick事件中获取listView传递的数据
    超详细Android接入支付宝支付实现,有图有真相
    Android蓝牙开发---与蓝牙模块进行通信
    Leecode no.19 删除链表的倒数第 N 个结点
    玩转java静态/动态代理
    Leecode no.198. 打家劫舍
  • 原文地址:https://www.cnblogs.com/xiao02fang/p/12933835.html
Copyright © 2011-2022 走看看