zoukankan      html  css  js  c++  java
  • 9.3.4 BeaufitulSoup4

      BeautifulSoup 是一个非常优秀的Python扩展库,可以用来从HTML或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器。

      使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。

      下面简单演示下BeautifulSoup4的功能,更加详细完整的学习资料请参考 https://www.crummy.com/software/BeautifulSoup/bs4/doc/。

      1 >>> from bs4 import BeautifulSoup
      2 >>> 
      3 >>> #自动添加和补全标签
      4 >>> BeautifulSoup('hello world','lxml')
      5 <html><body><p>hello world</p></body></html>
      6 >>> 
      7 >>> #自定义一个html文档内容
      8 >>> html_doc = """
      9 <html><head><title>The Dormouse's story</title></head>
     10 <body>
     11 <p class="title"><b>The Dormouse's story</b></p>
     12 <p class="story">Once upon a time there were three little sisters;and their names were
     13 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
     14 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
     15 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
     16 and they lived at the bottom of a well.</p>
     17 
     18 <p class="story">...</p>
     19 """
     20 >>> 
     21 >>> #解析这段html文档内容,以优雅的方式展示出来
     22 >>> soup = BeautifulSoup(html_doc,'html.parser')
     23 >>> print(soup.prettify())
     24 <html>
     25  <head>
     26   <title>
     27    The Dormouse's story
     28   </title>
     29  </head>
     30  <body>
     31   <p class="title">
     32    <b>
     33     The Dormouse's story
     34    </b>
     35   </p>
     36   <p class="story">
     37    Once upon a time there were three little sisters;and their names were
     38    <a class="sister" href="http://example.com/elsie" id="link1">
     39     Elsie
     40    </a>
     41    ,
     42    <a class="sister" href="http://example.com/lacie" id="link2">
     43     Lacie
     44    </a>
     45    and
     46    <a class="sister" href="http://example.com/tillie" id="link3">
     47     Tillie
     48    </a>
     49    ;
     50 and they lived at the bottom of a well.
     51   </p>
     52   <p class="story">
     53    ...
     54   </p>
     55  </body>
     56 </html>
     57 >>> 
     58 >>> #访问特定标签
     59 >>> soup.title
     60 <title>The Dormouse's story</title>
     61 >>> 
     62 >>> #标签名字
     63 >>> soup.title.name
     64 'title'
     65 >>> 
     66 >>> #标签文本
     67 >>> soup.title.text
     68 "The Dormouse's story"
     69 >>> 
     70 >>> #title标签的上一级标签
     71 >>> soup.title.parent
     72 <head><title>The Dormouse's story</title></head>
     73 >>> 
     74 >>> soup.head
     75 <head><title>The Dormouse's story</title></head>
     76 >>> 
     77 >>> soup.b
     78 <b>The Dormouse's story</b>
     79 >>> 
     80 >>> soup.b.name
     81 'b'
     82 >>> soup.b.text
     83 "The Dormouse's story"
     84 >>> 
     85 >>> #把整个BeautifulSoup对象看作标签对象
     86 >>> soup.name
     87 '[document]'
     88 >>> 
     89 >>> soup.body
     90 <body>
     91 <p class="title"><b>The Dormouse's story</b></p>
     92 <p class="story">Once upon a time there were three little sisters;and their names were
     93 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
     94 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
     95 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
     96 and they lived at the bottom of a well.</p>
     97 <p class="story">...</p>
     98 </body>
     99 >>> 
    100 >>> soup.p
    101 <p class="title"><b>The Dormouse's story</b></p>
    102 >>> 
    103 >>> #标签属性
    104 >>> soup.p['class']
    105 ['title']
    106 >>> 
    107 >>> soup.p.get('class')         #也可以这样查看标签属性
    108 ['title']
    109 >>> 
    110 >>> soup.p.text
    111 "The Dormouse's story"
    112 >>> 
    113 >>> soup.p.contents
    114 [<b>The Dormouse's story</b>]
    115 >>> 
    116 >>> soup.a
    117 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    118 >>> 
    119 >>> #查看a标签所有属性
    120 >>> soup.a.attrs
    121 {'class': ['sister'], 'id': 'link1', 'href': 'http://example.com/elsie'}
    122 >>> 
    123 >>> #查找所有a标签
    124 >>> soup.find_all('a')
    125 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    126 >>> 
    127 >>> #同时查找<a>和<b>标签
    128 >>> soup.find_all(['a','b'])
    129 [<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    130 >>> 
    131 >>> import re
    132 >>> #查找href包含特定关键字的标签
    133 >>> soup.find_all(href=re.compile("elsie"))
    134 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
    135 >>> 
    136 >>> soup.find(id='link3')
    137 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    138 >>> 
    139 >>> soup.find_all('a',id='link3')
    140 [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    141 >>> 
    142 >>> for link in soup.find_all('a'):
    143     print(link.text,':',link.get('href'))
    144 
    145     
    146 Elsie : http://example.com/elsie
    147 Lacie : http://example.com/lacie
    148 Tillie : http://example.com/tillie
    149 >>> 
    150 >>> print(soup.get_text())           #返回所有文本
    151 
    152 The Dormouse's story
    153 
    154 The Dormouse's story
    155 Once upon a time there were three little sisters;and their names were
    156 Elsie,
    157 Lacieand
    158 Tillie;
    159 and they lived at the bottom of a well.
    160 ...
    161 
    162 >>> 
    163 >>> #修改标签属性
    164 >>> soup.a['id']='test_link1'
    165 >>> soup.a
    166 <a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>
    167 >>> 
    168 >>> #修改标签文本
    169 >>> soup.a.string.replace_with('test_Elsie')
    170 'Elsie'
    171 >>> 
    172 >>> soup.a.string
    173 'test_Elsie'
    174 >>> 
    175 >>> print(soup.prettify())
    176 <html>
    177  <head>
    178   <title>
    179    The Dormouse's story
    180   </title>
    181  </head>
    182  <body>
    183   <p class="title">
    184    <b>
    185     The Dormouse's story
    186    </b>
    187   </p>
    188   <p class="story">
    189    Once upon a time there were three little sisters;and their names were
    190    <a class="sister" href="http://example.com/elsie" id="test_link1">
    191     test_Elsie
    192    </a>
    193    ,
    194    <a class="sister" href="http://example.com/lacie" id="link2">
    195     Lacie
    196    </a>
    197    and
    198    <a class="sister" href="http://example.com/tillie" id="link3">
    199     Tillie
    200    </a>
    201    ;
    202 and they lived at the bottom of a well.
    203   </p>
    204   <p class="story">
    205    ...
    206   </p>
    207  </body>
    208 </html>
    209 >>> 
    210 >>> 
    211 >>> #遍历子标签
    212 >>> for child in soup.body.children:
    213     print(child)
    214 
    215     
    216 
    217 
    218 <p class="title"><b>The Dormouse's story</b></p>
    219 
    220 
    221 <p class="story">Once upon a time there were three little sisters;and their names were
    222 <a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,
    223 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
    224 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    225 and they lived at the bottom of a well.</p>
    226 
    227 
    228 <p class="story">...</p>
    229 
    230 
    231 >>> 
  • 相关阅读:
    背包问题
    计蒜客lev3
    线段树BIT操作总结
    图论题收集
    Codeforces Round #607 (Div. 2) 训练总结及A-F题解
    2-sat 学习笔记
    洛谷 P3338 【ZJOI2014】力/BZOJ 3527 力 题解
    $noi.ac$ #51 array 题解
    洛谷 P3292 【SCOI2016】幸运数字/BZOJ 4568 幸运数字 题解
    洛谷 P5283 【十二省联考2019】异或粽子 题解
  • 原文地址:https://www.cnblogs.com/avention/p/8991818.html
Copyright © 2011-2022 走看看