zoukankan      html  css  js  c++  java
  • beautifulsoup

       1 文章目录
       2 1.解析库
       3 2.基本使用
       4 3.标签选择器
       5 3.1选择元素
       6 3.2获取名称
       7 3.3获取属性
       8 3.4获取内容
       9 3.5嵌套选择
      10 3.6子节点和子孙节点
      11 3.7父节点和祖先节点
      12 3.8兄弟节点
      13 4标准选择器
      14 4.1find_all( name , attrs , recursive , text , **kwargs )
      15 4.1.1name
      16 4.1.2attrs
      17 4.1.3text
      18 4.2find( name , attrs , recursive , text , **kwargs )
      19 4.3find_parents() find_parent()
      20 4.4find_next_siblings() find_next_sibling()
      21 4.5find_previous_siblings() find_previous_sibling()
      22 4.6find_all_next() find_next()
      23 4.7find_all_previous() 和 find_previous()
      24 5.CSS选择器
      25 5.1获取属性
      26 5.2获取内容
      27 6.总结
      28 1.解析库
      29 灵活又方便的网页解析库,处理高效,支持多种解析器。
      30 利用它不用编写正则表达式即可方便地实现网页信息的提取。
      31 安装:pip3 install BeautifulSoup4
      32 
      33 解析器    使用方法    优势    劣势
      34 Python标准库    BeautifulSoup(markup, “html.parser”)    Python的内置标准库、执行速度适中 、文档容错能力强    Python 2.7.3 or 3.2.2)前的版本中文容错能力差
      35 lxml HTML 解析器    BeautifulSoup(markup, “lxml”)    速度快、文档容错能力强    需要安装C语言库
      36 lxml XML 解析器    BeautifulSoup(markup, “xml”)    速度快、唯一支持XML的解析器    需要安装C语言库
      37 html5lib    BeautifulSoup(markup, “html5lib”)    最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档    速度慢、不依赖外部扩展
      38 2.基本使用
      39 html = """
      40 <html><head><title>The Dormouse's story</title></head>
      41 <body>
      42 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
      43 <p class="story">Once upon a time there were three little sisters; and their names were
      44 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
      45 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
      46 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
      47 and they lived at the bottom of a well.</p>
      48 <p class="story">...</p>
      49 """
      50 from bs4 import BeautifulSoup
      51 soup = BeautifulSoup(html, 'lxml')
      52 print(soup.prettify())
      53 print(soup.title.string)
      54 1
      55 2
      56 3
      57 4
      58 5
      59 6
      60 7
      61 8
      62 9
      63 10
      64 11
      65 12
      66 13
      67 14
      68 15
      69 <html>
      70  <head>
      71   <title>
      72    The Dormouse's story
      73   </title>
      74  </head>
      75  <body>
      76   <p class="title" name="dromouse">
      77    <b>
      78     The Dormouse's story
      79    </b>
      80   </p>
      81   <p class="story">
      82    Once upon a time there were three little sisters; and their names were
      83    <a class="sister" href="http://example.com/elsie" id="link1">
      84     <!-- Elsie -->
      85    </a>
      86    ,
      87    <a class="sister" href="http://example.com/lacie" id="link2">
      88     Lacie
      89    </a>
      90    and
      91    <a class="sister" href="http://example.com/tillie" id="link3">
      92     Tillie
      93    </a>
      94    ;
      95 and they lived at the bottom of a well.
      96   </p>
      97   <p class="story">
      98    ...
      99   </p>
     100  </body>
     101 </html>
     102 The Dormouse's story
     103 1
     104 2
     105 3
     106 4
     107 5
     108 6
     109 7
     110 8
     111 9
     112 10
     113 11
     114 12
     115 13
     116 14
     117 15
     118 16
     119 17
     120 18
     121 19
     122 20
     123 21
     124 22
     125 23
     126 24
     127 25
     128 26
     129 27
     130 28
     131 29
     132 30
     133 31
     134 32
     135 33
     136 34
     137 3.标签选择器
     138 3.1选择元素
     139 html = """
     140 <html><head><title>The Dormouse's story</title></head>
     141 <body>
     142 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
     143 <p class="story">Once upon a time there were three little sisters; and their names were
     144 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
     145 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
     146 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
     147 and they lived at the bottom of a well.</p>
     148 <p class="story">...</p>
     149 """
     150 from bs4 import BeautifulSoup
     151 soup = BeautifulSoup(html, 'lxml')
     152 print(soup.title)
     153 print(type(soup.title))
     154 print(soup.head)
     155 print(soup.p)
     156 1
     157 2
     158 3
     159 4
     160 5
     161 6
     162 7
     163 8
     164 9
     165 10
     166 11
     167 12
     168 13
     169 14
     170 15
     171 16
     172 17
     173 <title>The Dormouse's story</title>
     174 <class 'bs4.element.Tag'>
     175 <head><title>The Dormouse's story</title></head>
     176 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
     177 1
     178 2
     179 3
     180 4
     181 3.2获取名称
     182 html = """
     183 <html><head><title>The Dormouse's story</title></head>
     184 <body>
     185 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
     186 <p class="story">Once upon a time there were three little sisters; and their names were
     187 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
     188 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
     189 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
     190 and they lived at the bottom of a well.</p>
     191 <p class="story">...</p>
     192 """
     193 from bs4 import BeautifulSoup
     194 soup = BeautifulSoup(html, 'lxml')
     195 print(soup.title.name)
     196 1
     197 2
     198 3
     199 4
     200 5
     201 6
     202 7
     203 8
     204 9
     205 10
     206 11
     207 12
     208 13
     209 14
     210 title
     211 1
     212 3.3获取属性
     213 html = """
     214 <html><head><title>The Dormouse's story</title></head>
     215 <body>
     216 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
     217 <p class="story">Once upon a time there were three little sisters; and their names were
     218 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
     219 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
     220 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
     221 and they lived at the bottom of a well.</p>
     222 <p class="story">...</p>
     223 """
     224 from bs4 import BeautifulSoup
     225 soup = BeautifulSoup(html, 'lxml')
     226 print(soup.p.attrs['name'])
     227 print(soup.p['name'])
     228 1
     229 2
     230 3
     231 4
     232 5
     233 6
     234 7
     235 8
     236 9
     237 10
     238 11
     239 12
     240 13
     241 14
     242 15
     243 dromouse
     244 dromouse
     245 1
     246 2
     247 3.4获取内容
     248 html = """
     249 <html><head><title>The Dormouse's story</title></head>
     250 <body>
     251 <p clss="title" name="dromouse"><b>The Dormouse's story</b></p>
     252 <p class="story">Once upon a time there were three little sisters; and their names were
     253 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
     254 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
     255 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
     256 and they lived at the bottom of a well.</p>
     257 <p class="story">...</p>
     258 """
     259 from bs4 import BeautifulSoup
     260 soup = BeautifulSoup(html, 'lxml')
     261 print(soup.p.string)
     262 1
     263 2
     264 3
     265 4
     266 5
     267 6
     268 7
     269 8
     270 9
     271 10
     272 11
     273 12
     274 13
     275 14
     276 The Dormouse's story
     277 1
     278 3.5嵌套选择
     279 html = """
     280 <html><head><title>The Dormouse's story</title></head>
     281 <body>
     282 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
     283 <p class="story">Once upon a time there were three little sisters; and their names were
     284 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
     285 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
     286 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
     287 and they lived at the bottom of a well.</p>
     288 <p class="story">...</p>
     289 """
     290 from bs4 import BeautifulSoup
     291 soup = BeautifulSoup(html, 'lxml')
     292 print(soup.head.title.string)
     293 1
     294 2
     295 3
     296 4
     297 5
     298 6
     299 7
     300 8
     301 9
     302 10
     303 11
     304 12
     305 13
     306 14
     307 The Dormouse's story
     308 1
     309 3.6子节点和子孙节点
     310 html = """
     311 <html>
     312     <head>
     313         <title>The Dormouse's story</title>
     314     </head>
     315     <body>
     316         <p class="story">
     317             Once upon a time there were three little sisters; and their names were
     318             <a href="http://example.com/elsie" class="sister" id="link1">
     319                 <span>Elsie</span>
     320             </a>
     321             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
     322             and
     323             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
     324             and they lived at the bottom of a well.
     325         </p>
     326         <p class="story">...</p>
     327 """
     328 from bs4 import BeautifulSoup
     329 soup = BeautifulSoup(html, 'lxml')
     330 print(soup.p.contents)
     331 1
     332 2
     333 3
     334 4
     335 5
     336 6
     337 7
     338 8
     339 9
     340 10
     341 11
     342 12
     343 13
     344 14
     345 15
     346 16
     347 17
     348 18
     349 19
     350 20
     351 21
     352 ['
                Once upon a time there were three little sisters; and their names were
                ', <a class="sister" href="http://example.com/elsie" id="link1">
     353 <span>Elsie</span>
     354 </a>, '
    ', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' 
                and
                ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '
                and they lived at the bottom of a well.
            ']
     355 1
     356 2
     357 3
     358 html = """
     359 <html>
     360     <head>
     361         <title>The Dormouse's story</title>
     362     </head>
     363     <body>
     364         <p class="story">
     365             Once upon a time there were three little sisters; and their names were
     366             <a href="http://example.com/elsie" class="sister" id="link1">
     367                 <span>Elsie</span>
     368             </a>
     369             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
     370             and
     371             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
     372             and they lived at the bottom of a well.
     373         </p>
     374         <p class="story">...</p>
     375 """
     376 from bs4 import BeautifulSoup
     377 soup = BeautifulSoup(html, 'lxml')
     378 print(soup.p.children)
     379 for i, child in enumerate(soup.p.children):
     380     print(i, child)
     381 1
     382 2
     383 3
     384 4
     385 5
     386 6
     387 7
     388 8
     389 9
     390 10
     391 11
     392 12
     393 13
     394 14
     395 15
     396 16
     397 17
     398 18
     399 19
     400 20
     401 21
     402 22
     403 23
     404 <list_iterator object at 0x1064f7dd8>
     405 0 
     406             Once upon a time there were three little sisters; and their names were
     407             
     408 1 <a class="sister" href="http://example.com/elsie" id="link1">
     409 <span>Elsie</span>
     410 </a>
     411 2 
     412 
     413 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
     414 4  
     415             and
     416             
     417 5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
     418 6 
     419             and they lived at the bottom of a well.
     420 1
     421 2
     422 3
     423 4
     424 5
     425 6
     426 7
     427 8
     428 9
     429 10
     430 11
     431 12
     432 13
     433 14
     434 15
     435 16
     436 html = """
     437 <html>
     438     <head>
     439         <title>The Dormouse's story</title>
     440     </head>
     441     <body>
     442         <p class="story">
     443             Once upon a time there were three little sisters; and their names were
     444             <a href="http://example.com/elsie" class="sister" id="link1">
     445                 <span>Elsie</span>
     446             </a>
     447             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
     448             and
     449             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
     450             and they lived at the bottom of a well.
     451         </p>
     452         <p class="story">...</p>
     453 """
     454 from bs4 import BeautifulSoup
     455 soup = BeautifulSoup(html, 'lxml')
     456 print(soup.p.descendants)
     457 for i, child in enumerate(soup.p.descendants):
     458     print(i, child)
     459 1
     460 2
     461 3
     462 4
     463 5
     464 6
     465 7
     466 8
     467 9
     468 10
     469 11
     470 12
     471 13
     472 14
     473 15
     474 16
     475 17
     476 18
     477 19
     478 20
     479 21
     480 22
     481 23
     482 <generator object descendants at 0x10650e678>
     483 0 
     484             Once upon a time there were three little sisters; and their names were
     485             
     486 1 <a class="sister" href="http://example.com/elsie" id="link1">
     487 <span>Elsie</span>
     488 </a>
     489 2 
     490 
     491 3 <span>Elsie</span>
     492 4 Elsie
     493 5 
     494 
     495 6 
     496 
     497 7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
     498 8 Lacie
     499 9  
     500             and
     501             
     502 10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
     503 11 Tillie
     504 12 
     505             and they lived at the bottom of a well.
     506 1
     507 2
     508 3
     509 4
     510 5
     511 6
     512 7
     513 8
     514 9
     515 10
     516 11
     517 12
     518 13
     519 14
     520 15
     521 16
     522 17
     523 18
     524 19
     525 20
     526 21
     527 22
     528 23
     529 24
     530 3.7父节点和祖先节点
     531 html = """
     532 <html>
     533     <head>
     534         <title>The Dormouse's story</title>
     535     </head>
     536     <body>
     537         <p class="story">
     538             Once upon a time there were three little sisters; and their names were
     539             <a href="http://example.com/elsie" class="sister" id="link1">
     540                 <span>Elsie</span>
     541             </a>
     542             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
     543             and
     544             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
     545             and they lived at the bottom of a well.
     546         </p>
     547         <p class="story">...</p>
     548 """
     549 from bs4 import BeautifulSoup
     550 soup = BeautifulSoup(html, 'lxml')
     551 print(soup.a.parent)
     552 1
     553 2
     554 3
     555 4
     556 5
     557 6
     558 7
     559 8
     560 9
     561 10
     562 11
     563 12
     564 13
     565 14
     566 15
     567 16
     568 17
     569 18
     570 19
     571 20
     572 21
     573 <p class="story">
     574             Once upon a time there were three little sisters; and their names were
     575             <a class="sister" href="http://example.com/elsie" id="link1">
     576 <span>Elsie</span>
     577 </a>
     578 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
     579             and
     580             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
     581             and they lived at the bottom of a well.
     582         </p>
     583 1
     584 2
     585 3
     586 4
     587 5
     588 6
     589 7
     590 8
     591 9
     592 10
     593 html = """
     594 <html>
     595     <head>
     596         <title>The Dormouse's story</title>
     597     </head>
     598     <body>
     599         <p class="story">
     600             Once upon a time there were three little sisters; and their names were
     601             <a href="http://example.com/elsie" class="sister" id="link1">
     602                 <span>Elsie</span>
     603             </a>
     604             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
     605             and
     606             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
     607             and they lived at the bottom of a well.
     608         </p>
     609         <p class="story">...</p>
     610 """
     611 from bs4 import BeautifulSoup
     612 soup = BeautifulSoup(html, 'lxml')
     613 print(list(enumerate(soup.a.parents)))
     614 1
     615 2
     616 3
     617 4
     618 5
     619 6
     620 7
     621 8
     622 9
     623 10
     624 11
     625 12
     626 13
     627 14
     628 15
     629 16
     630 17
     631 18
     632 19
     633 20
     634 21
     635 [(0, <p class="story">
     636             Once upon a time there were three little sisters; and their names were
     637             <a class="sister" href="http://example.com/elsie" id="link1">
     638 <span>Elsie</span>
     639 </a>
     640 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
     641             and
     642             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
     643             and they lived at the bottom of a well.
     644         </p>), (1, <body>
     645 <p class="story">
     646             Once upon a time there were three little sisters; and their names were
     647             <a class="sister" href="http://example.com/elsie" id="link1">
     648 <span>Elsie</span>
     649 </a>
     650 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
     651             and
     652             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
     653             and they lived at the bottom of a well.
     654         </p>
     655 <p class="story">...</p>
     656 </body>), (2, <html>
     657 <head>
     658 <title>The Dormouse's story</title>
     659 </head>
     660 <body>
     661 <p class="story">
     662             Once upon a time there were three little sisters; and their names were
     663             <a class="sister" href="http://example.com/elsie" id="link1">
     664 <span>Elsie</span>
     665 </a>
     666 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
     667             and
     668             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
     669             and they lived at the bottom of a well.
     670         </p>
     671 <p class="story">...</p>
     672 </body></html>), (3, <html>
     673 <head>
     674 <title>The Dormouse's story</title>
     675 </head>
     676 <body>
     677 <p class="story">
     678             Once upon a time there were three little sisters; and their names were
     679             <a class="sister" href="http://example.com/elsie" id="link1">
     680 <span>Elsie</span>
     681 </a>
     682 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
     683             and
     684             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
     685             and they lived at the bottom of a well.
     686         </p>
     687 <p class="story">...</p>
     688 </body></html>)]
     689 1
     690 2
     691 3
     692 4
     693 5
     694 6
     695 7
     696 8
     697 9
     698 10
     699 11
     700 12
     701 13
     702 14
     703 15
     704 16
     705 17
     706 18
     707 19
     708 20
     709 21
     710 22
     711 23
     712 24
     713 25
     714 26
     715 27
     716 28
     717 29
     718 30
     719 31
     720 32
     721 33
     722 34
     723 35
     724 36
     725 37
     726 38
     727 39
     728 40
     729 41
     730 42
     731 43
     732 44
     733 45
     734 46
     735 47
     736 48
     737 49
     738 50
     739 51
     740 52
     741 53
     742 54
     743 3.8兄弟节点
     744 html = """
     745 <html>
     746     <head>
     747         <title>The Dormouse's story</title>
     748     </head>
     749     <body>
     750         <p class="story">
     751             Once upon a time there were three little sisters; and their names were
     752             <a href="http://example.com/elsie" class="sister" id="link1">
     753                 <span>Elsie</span>
     754             </a>
     755             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
     756             and
     757             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
     758             and they lived at the bottom of a well.
     759         </p>
     760         <p class="story">...</p>
     761 """
     762 from bs4 import BeautifulSoup
     763 soup = BeautifulSoup(html, 'lxml')
     764 print(list(enumerate(soup.a.next_siblings)))
     765 print(list(enumerate(soup.a.previous_siblings)))
     766 1
     767 2
     768 3
     769 4
     770 5
     771 6
     772 7
     773 8
     774 9
     775 10
     776 11
     777 12
     778 13
     779 14
     780 15
     781 16
     782 17
     783 18
     784 19
     785 20
     786 21
     787 22
     788 [(0, '
    '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' 
                and
                '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '
                and they lived at the bottom of a well.
            ')]
     789 [(0, '
                Once upon a time there were three little sisters; and their names were
                ')]
     790 1
     791 2
     792 4标准选择器
     793 4.1find_all( name , attrs , recursive , text , **kwargs )
     794 可根据标签名、属性、内容查找文档
     795 
     796 4.1.1name
     797 html='''
     798 <div class="panel">
     799     <div class="panel-heading">
     800         <h4>Hello</h4>
     801     </div>
     802     <div class="panel-body">
     803         <ul class="list" id="list-1">
     804             <li class="element">Foo</li>
     805             <li class="element">Bar</li>
     806             <li class="element">Jay</li>
     807         </ul>
     808         <ul class="list list-small" id="list-2">
     809             <li class="element">Foo</li>
     810             <li class="element">Bar</li>
     811         </ul>
     812     </div>
     813 </div>
     814 '''
     815 from bs4 import BeautifulSoup
     816 soup = BeautifulSoup(html, 'lxml')
     817 print(soup.find_all('ul'))
     818 print(type(soup.find_all('ul')[0]))
     819 1
     820 2
     821 3
     822 4
     823 5
     824 6
     825 7
     826 8
     827 9
     828 10
     829 11
     830 12
     831 13
     832 14
     833 15
     834 16
     835 17
     836 18
     837 19
     838 20
     839 21
     840 22
     841 [<ul class="list" id="list-1">
     842 <li class="element">Foo</li>
     843 <li class="element">Bar</li>
     844 <li class="element">Jay</li>
     845 </ul>, <ul class="list list-small" id="list-2">
     846 <li class="element">Foo</li>
     847 <li class="element">Bar</li>
     848 </ul>]
     849 <class 'bs4.element.Tag'>
     850 1
     851 2
     852 3
     853 4
     854 5
     855 6
     856 7
     857 8
     858 9
     859 html='''
     860 <div class="panel">
     861     <div class="panel-heading">
     862         <h4>Hello</h4>
     863     </div>
     864     <div class="panel-body">
     865         <ul class="list" id="list-1">
     866             <li class="element">Foo</li>
     867             <li class="element">Bar</li>
     868             <li class="element">Jay</li>
     869         </ul>
     870         <ul class="list list-small" id="list-2">
     871             <li class="element">Foo</li>
     872             <li class="element">Bar</li>
     873         </ul>
     874     </div>
     875 </div>
     876 '''
     877 from bs4 import BeautifulSoup
     878 soup = BeautifulSoup(html, 'lxml')
     879 for ul in soup.find_all('ul'):
     880     print(ul.find_all('li'))
     881 1
     882 2
     883 3
     884 4
     885 5
     886 6
     887 7
     888 8
     889 9
     890 10
     891 11
     892 12
     893 13
     894 14
     895 15
     896 16
     897 17
     898 18
     899 19
     900 20
     901 21
     902 22
     903 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
     904 [<li class="element">Foo</li>, <li class="element">Bar</li>]
     905 1
     906 2
     907 4.1.2attrs
     908 html='''
     909 <div class="panel">
     910     <div class="panel-heading">
     911         <h4>Hello</h4>
     912     </div>
     913     <div class="panel-body">
     914         <ul class="list" id="list-1" name="elements">
     915             <li class="element">Foo</li>
     916             <li class="element">Bar</li>
     917             <li class="element">Jay</li>
     918         </ul>
     919         <ul class="list list-small" id="list-2">
     920             <li class="element">Foo</li>
     921             <li class="element">Bar</li>
     922         </ul>
     923     </div>
     924 </div>
     925 '''
     926 from bs4 import BeautifulSoup
     927 soup = BeautifulSoup(html, 'lxml')
     928 print(soup.find_all(attrs={'id': 'list-1'}))
     929 print(soup.find_all(attrs={'name': 'elements'}))
     930 1
     931 2
     932 3
     933 4
     934 5
     935 6
     936 7
     937 8
     938 9
     939 10
     940 11
     941 12
     942 13
     943 14
     944 15
     945 16
     946 17
     947 18
     948 19
     949 20
     950 21
     951 22
     952 [<ul class="list" id="list-1" name="elements">
     953 <li class="element">Foo</li>
     954 <li class="element">Bar</li>
     955 <li class="element">Jay</li>
     956 </ul>]
     957 [<ul class="list" id="list-1" name="elements">
     958 <li class="element">Foo</li>
     959 <li class="element">Bar</li>
     960 <li class="element">Jay</li>
     961 </ul>]
     962 1
     963 2
     964 3
     965 4
     966 5
     967 6
     968 7
     969 8
     970 9
     971 10
     972 html='''
     973 <div class="panel">
     974     <div class="panel-heading">
     975         <h4>Hello</h4>
     976     </div>
     977     <div class="panel-body">
     978         <ul class="list" id="list-1">
     979             <li class="element">Foo</li>
     980             <li class="element">Bar</li>
     981             <li class="element">Jay</li>
     982         </ul>
     983         <ul class="list list-small" id="list-2">
     984             <li class="element">Foo</li>
     985             <li class="element">Bar</li>
     986         </ul>
     987     </div>
     988 </div>
     989 '''
     990 from bs4 import BeautifulSoup
     991 soup = BeautifulSoup(html, 'lxml')
     992 print(soup.find_all(id='list-1'))
     993 print(soup.find_all(class_='element'))
     994 1
     995 2
     996 3
     997 4
     998 5
     999 6
    1000 7
    1001 8
    1002 9
    1003 10
    1004 11
    1005 12
    1006 13
    1007 14
    1008 15
    1009 16
    1010 17
    1011 18
    1012 19
    1013 20
    1014 21
    1015 22
    1016 [<ul class="list" id="list-1">
    1017 <li class="element">Foo</li>
    1018 <li class="element">Bar</li>
    1019 <li class="element">Jay</li>
    1020 </ul>]
    1021 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
    1022 1
    1023 2
    1024 3
    1025 4
    1026 5
    1027 6
    1028 4.1.3text
    1029 html='''
    1030 <div class="panel">
    1031     <div class="panel-heading">
    1032         <h4>Hello</h4>
    1033     </div>
    1034     <div class="panel-body">
    1035         <ul class="list" id="list-1">
    1036             <li class="element">Foo</li>
    1037             <li class="element">Bar</li>
    1038             <li class="element">Jay</li>
    1039         </ul>
    1040         <ul class="list list-small" id="list-2">
    1041             <li class="element">Foo</li>
    1042             <li class="element">Bar</li>
    1043         </ul>
    1044     </div>
    1045 </div>
    1046 '''
    1047 from bs4 import BeautifulSoup
    1048 soup = BeautifulSoup(html, 'lxml')
    1049 print(soup.find_all(text='Foo'))
    1050 1
    1051 2
    1052 3
    1053 4
    1054 5
    1055 6
    1056 7
    1057 8
    1058 9
    1059 10
    1060 11
    1061 12
    1062 13
    1063 14
    1064 15
    1065 16
    1066 17
    1067 18
    1068 19
    1069 20
    1070 21
    1071 ['Foo', 'Foo']
    1072 1
    1073 4.2find( name , attrs , recursive , text , **kwargs )
    1074 find返回单个元素,find_all返回所有元素
    1075 
    1076 html='''
    1077 <div class="panel">
    1078     <div class="panel-heading">
    1079         <h4>Hello</h4>
    1080     </div>
    1081     <div class="panel-body">
    1082         <ul class="list" id="list-1">
    1083             <li class="element">Foo</li>
    1084             <li class="element">Bar</li>
    1085             <li class="element">Jay</li>
    1086         </ul>
    1087         <ul class="list list-small" id="list-2">
    1088             <li class="element">Foo</li>
    1089             <li class="element">Bar</li>
    1090         </ul>
    1091     </div>
    1092 </div>
    1093 '''
    1094 from bs4 import BeautifulSoup
    1095 soup = BeautifulSoup(html, 'lxml')
    1096 print(soup.find('ul'))
    1097 print(type(soup.find('ul')))
    1098 print(soup.find('page'))
    1099 1
    1100 2
    1101 3
    1102 4
    1103 5
    1104 6
    1105 7
    1106 8
    1107 9
    1108 10
    1109 11
    1110 12
    1111 13
    1112 14
    1113 15
    1114 16
    1115 17
    1116 18
    1117 19
    1118 20
    1119 21
    1120 22
    1121 23
    1122 <ul class="list" id="list-1">
    1123 <li class="element">Foo</li>
    1124 <li class="element">Bar</li>
    1125 <li class="element">Jay</li>
    1126 </ul>
    1127 <class 'bs4.element.Tag'>
    1128 None
    1129 1
    1130 2
    1131 3
    1132 4
    1133 5
    1134 6
    1135 7
    1136 4.3find_parents() find_parent()
    1137 find_parents()返回所有祖先节点,find_parent()返回直接父节点。
    1138 
    1139 4.4find_next_siblings() find_next_sibling()
    1140 find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
    1141 
    1142 4.5find_previous_siblings() find_previous_sibling()
    1143 find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
    1144 
    1145 4.6find_all_next() find_next()
    1146 find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
    1147 
    1148 4.7find_all_previous() 和 find_previous()
    1149 find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点
    1150 
    1151 5.CSS选择器
    1152 通过select()直接传入CSS选择器即可完成选择
    1153 
    1154 html='''
    1155 <div class="panel">
    1156     <div class="panel-heading">
    1157         <h4>Hello</h4>
    1158     </div>
    1159     <div class="panel-body">
    1160         <ul class="list" id="list-1">
    1161             <li class="element">Foo</li>
    1162             <li class="element">Bar</li>
    1163             <li class="element">Jay</li>
    1164         </ul>
    1165         <ul class="list list-small" id="list-2">
    1166             <li class="element">Foo</li>
    1167             <li class="element">Bar</li>
    1168         </ul>
    1169     </div>
    1170 </div>
    1171 '''
    1172 from bs4 import BeautifulSoup
    1173 soup = BeautifulSoup(html, 'lxml')
    1174 print(soup.select('.panel .panel-heading'))
    1175 print(soup.select('ul li'))
    1176 print(soup.select('#list-2 .element'))
    1177 print(type(soup.select('ul')[0]))
    1178 1
    1179 2
    1180 3
    1181 4
    1182 5
    1183 6
    1184 7
    1185 8
    1186 9
    1187 10
    1188 11
    1189 12
    1190 13
    1191 14
    1192 15
    1193 16
    1194 17
    1195 18
    1196 19
    1197 20
    1198 21
    1199 22
    1200 23
    1201 24
    1202 [<div class="panel-heading">
    1203 <h4>Hello</h4>
    1204 </div>]
    1205 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
    1206 [<li class="element">Foo</li>, <li class="element">Bar</li>]
    1207 <class 'bs4.element.Tag'>
    1208 1
    1209 2
    1210 3
    1211 4
    1212 5
    1213 6
    1214 html='''
    1215 <div class="panel">
    1216     <div class="panel-heading">
    1217         <h4>Hello</h4>
    1218     </div>
    1219     <div class="panel-body">
    1220         <ul class="list" id="list-1">
    1221             <li class="element">Foo</li>
    1222             <li class="element">Bar</li>
    1223             <li class="element">Jay</li>
    1224         </ul>
    1225         <ul class="list list-small" id="list-2">
    1226             <li class="element">Foo</li>
    1227             <li class="element">Bar</li>
    1228         </ul>
    1229     </div>
    1230 </div>
    1231 '''
    1232 from bs4 import BeautifulSoup
    1233 soup = BeautifulSoup(html, 'lxml')
    1234 for ul in soup.select('ul'):
    1235     print(ul.select('li'))
    1236 1
    1237 2
    1238 3
    1239 4
    1240 5
    1241 6
    1242 7
    1243 8
    1244 9
    1245 10
    1246 11
    1247 12
    1248 13
    1249 14
    1250 15
    1251 16
    1252 17
    1253 18
    1254 19
    1255 20
    1256 21
    1257 22
    1258 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    1259 [<li class="element">Foo</li>, <li class="element">Bar</li>]
    1260 1
    1261 2
    1262 5.1获取属性
    1263 html='''
    1264 <div class="panel">
    1265     <div class="panel-heading">
    1266         <h4>Hello</h4>
    1267     </div>
    1268     <div class="panel-body">
    1269         <ul class="list" id="list-1">
    1270             <li class="element">Foo</li>
    1271             <li class="element">Bar</li>
    1272             <li class="element">Jay</li>
    1273         </ul>
    1274         <ul class="list list-small" id="list-2">
    1275             <li class="element">Foo</li>
    1276             <li class="element">Bar</li>
    1277         </ul>
    1278     </div>
    1279 </div>
    1280 '''
    1281 from bs4 import BeautifulSoup
    1282 soup = BeautifulSoup(html, 'lxml')
    1283 for ul in soup.select('ul'):
    1284     print(ul['id'])
    1285     print(ul.attrs['id'])
    1286 1
    1287 2
    1288 3
    1289 4
    1290 5
    1291 6
    1292 7
    1293 8
    1294 9
    1295 10
    1296 11
    1297 12
    1298 13
    1299 14
    1300 15
    1301 16
    1302 17
    1303 18
    1304 19
    1305 20
    1306 21
    1307 22
    1308 23
    1309 list-1
    1310 list-1
    1311 list-2
    1312 list-2
    1313 1
    1314 2
    1315 3
    1316 4
    1317 5.2获取内容
    1318 html='''
    1319 <div class="panel">
    1320     <div class="panel-heading">
    1321         <h4>Hello</h4>
    1322     </div>
    1323     <div class="panel-body">
    1324         <ul class="list" id="list-1">
    1325             <li class="element">Foo</li>
    1326             <li class="element">Bar</li>
    1327             <li class="element">Jay</li>
    1328         </ul>
    1329         <ul class="list list-small" id="list-2">
    1330             <li class="element">Foo</li>
    1331             <li class="element">Bar</li>
    1332         </ul>
    1333     </div>
    1334 </div>
    1335 '''
    1336 from bs4 import BeautifulSoup
    1337 soup = BeautifulSoup(html, 'lxml')
    1338 for li in soup.select('li'):
    1339     print(li.get_text())
    1340 1
    1341 2
    1342 3
    1343 4
    1344 5
    1345 6
    1346 7
    1347 8
    1348 9
    1349 10
    1350 11
    1351 12
    1352 13
    1353 14
    1354 15
    1355 16
    1356 17
    1357 18
    1358 19
    1359 20
    1360 21
    1361 22
    1362 Foo
    1363 Bar
    1364 Jay
    1365 Foo
    1366 Bar
    beautifulsoup

     https://blog.csdn.net/qq_42554007/article/details/90675142

  • 相关阅读:
    Webstorm常用快捷键
    微信内置浏览器是什么?(复制篇)
    jquery.cookie.js 操作cookie实现记住密码功能的实现代码
    sublime text 3 快捷键大全
    http_load的安装及使用方法
    Mysql压测工具mysqlslap 讲解
    ERROR 1819 (HY000): Your password does not satisfy the current policy requirements
    percona-toolkit工具包的使用教程之开发类工具
    MYSQL管理之主从同步管理
    percona-toolkit系列之介绍和安装(mysql复制工具)
  • 原文地址:https://www.cnblogs.com/wangbin2020/p/13696529.html
Copyright © 2011-2022 走看看