zoukankan      html  css  js  c++  java
  • BeautifulSoup库的使用

    BeautifulSoup

    • 网页解析库

    • 提取网页数据

    • 第三方库,使用的时候进行安装: pip3 install beautifulsoup4

    解析库
    解析器使用方法
    python标准库 BeautifulSoup(markup,"html.parser")
    lxml HTML解析器 BeautifulSoup(markup,"lxml")
    lxml XML解析器 BeautifulSoup(markup,"xml")
    html5lib BeautifulSoup(markup,"html5lib")
    基本使用
    html ="""
    <html lang="zh-cn">
    <head>
      <meta charset="utf-8" />
      <meta name="viewport" content="width=device-width, initial-scale=1" />
    <meta name="referrer" content="origin" />
      <title>liudemeng - 博客园</title>
      <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
    <link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
    <link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
      <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
      <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
    <link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.prettify())
    print(soup.title.string)
    标签选择器
    选择元素
     html ="""
    <html lang="zh-cn">
    <head>
      <meta charset="utf-8" />
      <meta name="viewport" content="width=device-width, initial-scale=1" />
    <meta name="referrer" content="origin" />
      <title>liudemeng - 博客园</title>
      <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
    <link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
    <link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
      <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
      <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
    <link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.title)
    print(soup.link)
    print(soup.head)
    获取名称
     html ="""
    <html lang="zh-cn">
    <head>
      <meta charset="utf-8" />
      <meta name="viewport" content="width=device-width, initial-scale=1" />
    <meta name="referrer" content="origin" />
      <title>liudemeng - 博客园</title>
      <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
    <link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
    <link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
      <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
      <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
    <link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.title.name)  # title
    获取内容
    html ="""
    <html lang="zh-cn">
    <head>
      <meta charset="utf-8" />
      <meta content="width=device-width, initial-scale=1" />
    <meta name="referrer" content="origin" />
      <title>liudemeng - 博客园</title>
      <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
    <link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
    <link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
      <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
      <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
    <link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.link.attrs['rel'])
    获取内容
    html ="""
    <html lang="zh-cn">
    <head>
      <meta charset="utf-8" />
      <meta content="width=device-width, initial-scale=1" />
    <meta name="referrer" content="origin" />
      <title>liudemeng - 博客园</title>
      <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
    <link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
    <link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
      <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
      <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
    <link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.title.string)
    嵌套选择
    html ="""
    <html lang="zh-cn">
    <head>
      <meta charset="utf-8" />
      <meta content="width=device-width, initial-scale=1" />
    <meta name="referrer" content="origin" />
      <title>liudemeng - 博客园</title>
      <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
    <link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
    <link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
      <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
      <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
    <link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.head.title.string) # 嵌套选择
    # liudemeng - 博客园
    子节点和子孙节点
    html = """<div class="postCon"><div class="c_b_p_desc">摘要: 数据提取方法 json 数据交换格式,看起来像python中的(字典)的字符串 使用之前进行导包处理 import json 哪里会有json的数据 浏览器切换到手机版 抓包app json.loads 把json字符串转化为python类型 json.loads(json字符串) json.dum<a href="https://www.cnblogs.com/liudemeng/p/10715075.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>"""
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.div.contents)
    html = """<div class="postCon"><div class="c_b_p_desc">摘要: 数据提取方法 json 数据交换格式,看起来像python中的(字典)的字符串 使用之前进行导包处理 import json 哪里会有json的数据 浏览器切换到手机版 抓包app json.loads 把json字符串转化为python类型 json.loads(json字符串) json.dum<a href="https://www.cnblogs.com/liudemeng/p/10715075.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>"""
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.div.children)
    for i, child in enumerate(soup.div.children):
       print(i, child)
    父节点和祖先节点

    html ="""
    <html lang="zh-cn">    
    <head>
      <meta charset="utf-8" />
      <meta content="width=device-width, initial-scale=1" />
    <meta name="referrer" content="origin" />
      <title>liudemeng - 博客园</title>
      <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
    <link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
    <link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
      <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
      <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
    <link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.meta.parent)
    兄弟节点
    html ="""
    <html lang="zh-cn">
    <head>
      <meta charset="utf-8" />
      <meta content="width=device-width, initial-scale=1" />
    <meta name="referrer" content="origin" />
      <title>liudemeng - 博客园</title>
      <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
    <link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
    <link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
      <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
      <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
    <link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.meta.next_siblings)
    print(soup.meta.previous_siblings)
    标准选择器
    find_all(name, attrs, recursive,text,**kwargs)

    可根据标签名.属性.内容查找文档

    name
    html ="""
    <html lang="zh-cn">
    <head>
      <meta charset="utf-8" />
      <meta content="width=device-width, initial-scale=1" />
    <meta name="referrer" content="origin" />
      <title>liudemeng - 博客园</title>
      <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
    <link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
    <link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
      <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
      <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
    <link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all('link'))
    print(soup.find_all('link')[0])
    attrs
    html ="""
    <html lang="zh-cn">
    <head>
      <meta charset="utf-8" />
      <meta content="width=device-width, initial-scale=1" />
    <meta name="referrer" content="origin" />
      <title>liudemeng - 博客园</title>
      <link class='element' type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
    <link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
    <link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
      <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
      <link id="list-1" title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
    <link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(attrs={'id':"list-1"}))
    print(soup.find_all(class_='element'))
    # 以上两种写法都是正确的
    text
    html ="""
    <html lang="zh-cn">
    <head>
      <meta charset="utf-8" />
      <meta content="width=device-width, initial-scale=1" />
    <meta name="referrer" content="origin" />
      <title>liudemeng - 博客园</title>
      <link class='element' type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
    <link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
    <link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
      <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
      <link id="list-1" title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
    <link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(text="liudemeng - 博客园"))
    # ['liudemeng - 博客园']
    find(name, attrs, recursive,text,**kwargs)
    find返回单个元素,find_all返回所有的元素
    find_parents() 和 find_parent()
    find_parents() 返回所有的父节点, find_parent()返回父节点
    总结:
    • 建议使用lxml解析库, 必要时使用html.parser

    • 标签选择器筛选功能弱但是速度快

    • 建议使用find(), find_all() 查询匹配单个结果或者多个结果

    • 记住常用的获取属性以及文本值的方法

  • 相关阅读:
    HDOj-1412
    HDOJ-2153
    HDOJ-1002
    紫书 例题 11-5 UVa 10048 (Floyd求最大权值最小的路径)
    紫书 例题11-4 UVa247 (Floyd判断联通)
    最短路模板
    紫书 例题 11-3 UVa 1151 (有边集的最小生成树+二进制枚举子集)
    紫书 例题 11-2 UVa 1395(最大边减最小边最小的生成树)
    紫书 例题 11-1 UVa 12219 (表达式树)
    紫书 习题 8-25 UVa 11175 (结论证明)(配图)
  • 原文地址:https://www.cnblogs.com/liudemeng/p/10724281.html
Copyright © 2011-2022 走看看