zoukankan      html  css  js  c++  java
  • Python爬虫学习笔记(六)

    BS4:

    参考文档:

    https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

    Test1(简单使用):

    文本代码:

    """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """

    测试代码:

    # coding=gbk
    from bs4 import BeautifulSoup
    
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """
    
    # 1.转类型
    # 默认bs4会调用系统中lxml的解析库 => 警告提示
    # 主动设置 bs4解析库
    soup = BeautifulSoup(html_doc, 'lxml')
    
    # 2.格式化输出(补全)
    result = soup.prettify()
    print(result)

    返回:

    E:Python3.9python.exe H:/code/Python爬虫/Day07/01-beautiful_soup.py
    <html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="title">
       <b>
        The Dormouse's story
       </b>
      </p>
      <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">
        Elsie
       </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>
    
    Process finished with exit code 0
    View Code

    Test2(读取内容):

    代码:

    # coding=gbk
    from bs4 import BeautifulSoup
    
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """
    
    # 1.转类型
    # 默认bs4会调用系统中lxml的解析库 => 警告提示
    # 主动设置 bs4解析库
    soup = BeautifulSoup(html_doc, 'lxml')
    
    # 2.解析数据
    result1 = soup.head
    result2 = soup.p
    result3 = soup.a
    print(result1)
    print(result2)
    print(result3)
    
    # 3.读取内容
    result4 = soup.a.string
    print(result4)
    # 4.读取属性
    result5 = soup.a['href']
    print(result5)

    返回:

    <head><title>The Dormouse's story</title></head>
    <p class="title"><b>The Dormouse's story</b></p>
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    Elsie
    http://example.com/elsie

    注:

    由返回结果可知,读取标签时只能读取第一个目标标签

    Test3(四大对象):

    四大对象:

    Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,
    每个节点都是Python对象,
    所有对象可以归纳为4种: 
    Tag , NavigableString , BeautifulSoup , Comment .

    代码:

    # coding=gbk
    from bs4 import BeautifulSoup
    
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="story"><!--s1mpL3...--></p>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    """
    # 1.转类型
    # 默认bs4会调用系统中lxml的解析库 => 警告提示
    # 主动设置 bs4解析库
    soup = BeautifulSoup(html_doc, 'lxml')
    print(type(soup))
    # 2.解析数据
    # Tag标签对象 bs4.element.Tag
    result1 = soup.head
    result2 = soup.p.string
    print(result2)
    result3 = soup.a
    print(type(result1))
    
    # 注释的内容类型 => bs4.element.Comment
    print(type(result2))
    
    print(type(result3))
    
    # 3.读取内容 NavigableString
    result4 = soup.a.string
    print(type(result4))
    
    # 4.读取属性
    result5 = soup.a['href']
    print(type(result5))
    print(type(soup))

    返回:

    <class 'bs4.BeautifulSoup'>
    s1mpL3...
    <class 'bs4.element.Tag'>
    <class 'bs4.element.Comment'>
    <class 'bs4.element.Tag'>
    <class 'bs4.element.NavigableString'>
    <class 'str'>
    <class 'bs4.BeautifulSoup'>

    Test4(通用方法 - find()):

    概述:

    find -- 返回符合查询条件的第一个标签对象

    代码:

    # coding=gbk
    from bs4 import BeautifulSoup
    
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="story"><!--s1mpL3...--></p>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    """
    # 1.转类型
    soup = BeautifulSoup(html_doc, 'lxml')
    # 2.通用解析方法
    # find -- 返回符合查询条件的第一个标签
    result1 = soup.find(name="p")
    result2 = soup.find(attrs={"class": "title"})
    result3 = soup.find(text="Tillie")
    result4 = soup.find(
        name="p",
        attrs={"class": "title"},
    )
    print(result1)
    print(result2)
    print(result3)
    print(result4)

    返回:

    <p class="story"><!--s1mpL3...--></p>
    <p class="title"><b>The Dormouse's story</b></p>
    Tillie
    <p class="title"><b>The Dormouse's story</b></p>

    Test5(通用方法 - find_all()):

    概述:

    findall -- 返回列表(list)标签对象

    代码:

    # coding=gbk
    from bs4 import BeautifulSoup
    
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="story"><!--s1mpL3...--></p>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    """
    # 1.转类型
    soup = BeautifulSoup(html_doc, 'lxml')
    # 2.通用解析方法
    # findall -- 返回列表(list)标签对象
    result1 = soup.find_all('a')
    result2 = soup.find_all("a", limit=1)[0]  # 该写法即为find()方法的源码
    result3 = soup.find_all(attrs={"class": "sister"})
    
    print(result1)
    print(result2)
    print(result3)

    返回:

    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

    Test6(通用方法 - select_one()):

    概述:

    select_one -- CSS选择器

    代码:

    # coding=gbk
    from bs4 import BeautifulSoup
    
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="story"><!--s1mpL3...--></p>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    """
    # 1.转类型
    soup = BeautifulSoup(html_doc, 'lxml')
    # 2.通用解析方法
    # find -- 返回符合查询条件的第一个标签# select_one -- CSS选择器
    # 查看该函数源码可知有limit限制,即limit=1
    result1 = soup.select_one('.sister')
    
    print(result1)

    返回:

    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

    Test7(通用方法 - select()):

    概述:

    select -- CSS选择器(list)

    代码:

    # coding=gbk
    from bs4 import BeautifulSoup
    
    html_doc = """
    <html><head><title id=one>The Dormouse's story</title></head>
    <body>
    <p class="story"><!--s1mpL3...--></p>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    """
    # 1.转类型
    soup = BeautifulSoup(html_doc, 'lxml')
    # 2.通用解析方法
    # select -- CSS选择器(list)
    result1 = soup.select('.sister')
    result2 = soup.select('#one')
    result3 = soup.select('head title')
    result4 = soup.select('title, .title')
    result5 = soup.select('a[id="link3"]')
    
    print(result1)
    print(result2)
    print(result3)
    print(result4)
    print(result5)

    返回:

    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    [<title id="one">The Dormouse's story</title>]
    [<title id="one">The Dormouse's story</title>]
    [<title id="one">The Dormouse's story</title>, <p class="title"><b>The Dormouse's story</b></p>]
    [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

    Test8(通用方法 - get_text()):

    代码:

    # coding=gbk
    from bs4 import BeautifulSoup
    
    html_doc = """
    <html><head><title id=one>The Dormouse's story</title></head>
    <body>
    <p class="story"><!--s1mpL3...--></p>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    """
    # 1.转类型
    soup = BeautifulSoup(html_doc, 'lxml')
    # 2.通用解析方法
    # 标签包裹内容 --- list result1 = soup.select('b')[0].get_text() # 标签的属性 result2 = soup.select('#link1')[0].get('href') print(result1) print(result2)

    返回:

    The Dormouse's story
    http://example.com/elsie

    XML:

    数据交互格式:

    前端,移动端和后台交互的数据格式

    参数:

    服务器,[ ],dict = {}

    key = value

    <key>value</key>

  • 相关阅读:
    mingw-gcc-10.0.1-experimental-i686-posix-sjlj-20200202-0303907
    可以修改 QtScrcpy 窗口大小的小工具
    autosub 添加代理服务器参数 -P --proxy
    Python网络数据采集系列-------概述
    【刷题笔记】I'm stuck! (迷宫)-----java方案
    【刷题笔记】火车购票-----java方案
    mvc自定义全局异常处理
    使用html2canvas实现浏览器截图
    再谈Newtonsoft.Json高级用法
    Spire.Doc组件读取与写入Word
  • 原文地址:https://www.cnblogs.com/3cH0-Nu1L/p/14487352.html
Copyright © 2011-2022 走看看