zoukankan      html  css  js  c++  java
  • Python爬虫基础之BeautifulSoup

    一、BeautifulSoup的基本使用

      1 from bs4 import BeautifulSoup
      2 from bs4 import SoupStrainer
      3 import re
      4 
      5 
      6 html_doc = """
      7 <html>
      8  <head>
      9   <title>
     10    The Dormouse's story
     11   </title>
     12  </head>
     13  <body>
     14   <p class="title">
     15    <b>
     16     The Dormouse's story
     17    </b>
     18   </p>
     19   <p class="story">
     20    Once upon a time there were three little sisters; and their names were
     21    <a class="sister" href="http://example.com/elsie" id="link1">
     22     Elsie
     23    </a>
     24    ,
     25    <a class="sister" href="http://example.com/lacie" id="link2">
     26     Lacie
     27    </a>
     28    and
     29    <a class="sister" href="http://example.com/tillie" id="link3">
     30     Tillie
     31    </a>
     32    ; and they lived at the bottom of a well.
     33   </p>
     34   <p class="story">
     35    ...
     36   </p>
     37  </body>
     38 </html>
     39 """
     40 soup = BeautifulSoup(html_doc, "html.parser")
     41 # print(soup.prettify()) # 打印所有标准化html code
     42 print('-----------------------------')
     43 print(soup.title)
     44 print('----------------------------')
     45 print(soup.title.name)
     46 print('----------------------------')
     47 print(soup.title.string)
     48 print('----------------------------')
     49 print(soup.title.parent.name)
     50 print('----------------------------')
     51 print(soup.p)
     52 # item_b = soup.p.
     53 print('----------------------------')
     54 print(soup.p['class'])
     55 print('----------------------------')
     56 print(soup.find_all('a'))
     57 print('----------------------------')
     58 print(soup.find(id='link3'))
     59 print(soup.find(id='link3')['class'])
     60 print(soup.find(id='link3')['href'])  # 打印指定属性文本
     61 print(soup.find(id='link3')['id'])
     62 print(soup.find(id='link3').get_text())  # 打印文本
     63 
     64 # find_all(name, attrs, recursive, text, limit, **kwargs)
     65 # name 参数
     66 soup.find_all('title')
     67 
     68 # keyword参数
     69 soup.find_all(id='link2')
     70 soup.find_all(href=re.compile("elsie"))
     71 soup.find_all(id=True) # 在文档树中查找所有包含 id 属性的tag,无论 id 的值是什么
     72 soup.find_all(href=re.compile("elsie"), id='link1') # 多个指定名字的参数可以同时过滤tag的多个属性
     73 soup.find_all(attrs={"data-foo": "value"}) # 可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:
     74 soup.find_all('a', limit=2)  # 当搜索结果到达limit个数,就停止搜索
     75 
     76 # 按CSS搜索
     77 soup.find_all("a", class_="sister")
     78 soup.find_all(class_=re.compile("itl"))  # class_ 参数同样接受不同类型的 过滤器 ,字符串,正则表达式
     79 
     80 # CSS选择器
     81 title_list = soup.select('head > title') # 查找所有满足条件的元素
     82 title_list_one = soup.select_one('head > title')  # 查找单个满足条件的元素
     83 print(title_list)  # 打印 [<title> The Dormouse's story</title>]
     84 print(title_list[0].string)  # 打印The Dormouse's story<
     85 
     86 # 文档中找到所有<a>标签的链接:
     87 for link in soup.find_all('a'):
     88     print(link.get('href'))
     89 # http://example.com/elsie
     90 # http://example.com/lacie
     91 # http://example.com/tillie
     92 
     93 # find查找元素第一个类样式未story的p标签
     94 p_story = soup.find('p',class_='story')
     95 # print(p_story.a)
     96 
     97 # 使用正则表达式
     98 p_re_all = soup.find_all(re.compile('p'))
     99 print(p_re_all)
    100 
    101 # find_all查找所有class_=True匹配任何类样式的p标签
    102 p_all = soup.find_all('p', class_=True)
    103 # print(p_all)  # 打印数组
    104 # [<p class="title">
    105 # <b>
    106 #     The Dormouse's story
    107 #    </b>
    108 # </p>, <p class="story">
    109 #    Once upon a time there were three little sisters; and their names were
    110 #    <a class="sister" href="http://example.com/elsie" id="link1">
    111 #     Elsie
    112 #    </a>
    113 #    ,
    114 #    <a class="sister" href="http://example.com/lacie" id="link2">
    115 #     Lacie
    116 #    </a>
    117 #    and
    118 #    <a class="sister" href="http://example.com/tillie" id="link3">
    119 #     Tillie
    120 #    </a>
    121 #    ; and they lived at the bottom of a well.
    122 #   </p>, <p class="story">
    123 #    ...
    124 #   </p>]

    二、BeautifulSoup的实际应用

    1.解析网易云音乐html源码

    这是网易云音乐华语歌曲的分类链接http://music.163.com/#/discover/playlist/?order=hot&cat=华语&limit=35&offset=0,打开Chrome F12的Elements查看到页面源码,我们发现每页的歌单都在一个iframe浮窗上面,每首单曲的信息构成一个li标签,包含歌单图片、

    歌单链接、歌单名称等。

    首先提取一段html源码出来

     1  <ul class="m-cvrlst f-cb" id="m-pl-container"> 
     2    <li> 
     3     <div class="u-cover u-cover-1"> 
     4      <img class="j-flag" src="http://p1.music.126.net/FGe-rVrHlBTbnOvhMR99PQ==/109951162989189558.jpg?param=140y140" /> 
     5      <a title="【说唱】留住你一面,画在我心间" href="/playlist?id=832790627" class="msk"></a> 
     6      <div class="bottom"> 
     7       <a class="icon-play f-fr" title="播放" href="javascript:;" data-res-type="13" data-res-id="832790627" data-res-action="play"></a> 
     8       <span class="icon-headset"></span> 
     9       <span class="nb">1615</span> 
    10      </div> 
    11     </div> <p class="dec"> <a title="【说唱】留住你一面,画在我心间" href="/playlist?id=832790627" class="tit f-thide s-fc0">【说唱】留住你一面,画在我心间</a> </p> <p><span class="s-fc4">by</span> <a title="JediMindTricks" href="/user/home?id=17647877" class="nm nm-icn f-thide s-fc3">JediMindTricks</a> <sup class="u-icn u-icn-84 "></sup> </p> </li> 
    12    <li> 
    13     <div class="u-cover u-cover-1"> 
    14      <img class="j-flag" src="http://p1.music.126.net/If644P7ZrfPm_qcvtYyfzg==/18936888765458653.jpg?param=140y140" /> 
    15      <a title="鞋子好看|国产自赏摇滚噪音流行" href="/playlist?id=721462105" class="msk"></a> 
    16      <div class="bottom"> 
    17       <a class="icon-play f-fr" title="播放" href="javascript:;" data-res-type="13" data-res-id="721462105" data-res-action="play"></a> 
    18       <span class="icon-headset"></span> 
    19       <span class="nb">77652</span> 
    20      </div> 
    21     </div> <p class="dec"> <a title="鞋子好看|国产自赏摇滚噪音流行" href="/playlist?id=721462105" class="tit f-thide s-fc0">鞋子好看|国产自赏摇滚噪音流行</a> </p> <p><span class="s-fc4">by</span> <a title="原创君" href="/user/home?id=201586" class="nm nm-icn f-thide s-fc3">原创君</a> <sup class="u-icn u-icn-1 "></sup> </p> </li> 
    22   </ul>

     开始解析html源码

    首先实例化一个BeautifulSoup对象,指定解析器为html.parser,通过BeautifulSoup对象的CSS选择器select_one(),这里用ID选择器搜索到无序列表ul,再通过find_all获取ul下的所有li标签,接着遍历li,获取到歌单的图片链接,歌单列表链接和歌单名称。

     1 from bs4 import BeautifulSoup
     2 
     3 html = '''上面提取的html源码'''
     4 soup = BeautifulSoup(html, 'html.parser')
     5 ul = soup.select_one('#m-pl-container')
     6 for li in ul.find_all('li'):
     7     img_url = li.img['src']
     8     a_msk = li.find('a', class_='msk')
     9     musicList_url = 'http:/%s' % a_msk['href']
    10     musicList_name = a_msk['title']
    11     print(img_url,musicList_url,musicList_name)  # 打印 http://p1.music.126.net/FGe-rVrHlBTbnOvhMR99PQ==/109951162989189558.jpg?param=140y140 http://playlist?id=832790627 【说唱】留住你一面,画在我心间

    三、Beautiful Soup 4.4.0

    Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.详细使用请转移官网 http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

  • 相关阅读:
    spring boot三种方式设置跨域
    完整卸载Mysql
    【OBIEE】OBIEE集成Echarts作图
    【OBIEE】BIEE培训(一)
    【Oracle】Oracle物化视图
    【Oracle】oracle11g安装过程提示swap size 检查失败问题
    【Linux】centOS7下安装GUI图形界面
    【Nginx】Linux环境搭建nginx负载
    【oracle】Oracle创建带参数视图
    抢票:搭建github最火的12306项目
  • 原文地址:https://www.cnblogs.com/taotaoblogs/p/7241282.html
Copyright © 2011-2022 走看看