zoukankan      html  css  js  c++  java
  • 慕课爬虫

    https://www.crummy.com/software/BeautifulSoup/

     1 #!/usr/bin/python
     2 # coding=utf-8
     3 
     4 from bs4 import BeautifulSoup
     5 import re
     6 
     7 html_doc = """
     8 <html><head><title>The Dormouse's story</title></head>
     9 <body>
    10 <p class="title"><b>The Dormouse's story</b></p>
    11 
    12 <p class="story">Once upon a time there were three little sisters; and their names were
    13 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    14 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    15 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    16 and they lived at the bottom of a well.</p>
    17 
    18 <p class="story">...</p>
    19 """
    20 
    21 soup = BeautifulSoup(html_doc,'html.parser',from_encoding = 'utf-8')
    22 
    23 print '获取所有的链接'
    24 links = soup.find_all('a')
    25 for link in links:
    26     print link.name, link['href'],link.get_text()
    27 
    28 print '获取lacie的链接'
    29 link_node = soup.find('a',href='http://example.com/lacie')
    30 print link_node.name, link_node['href'],link_node.get_text()
    31 
    32 print '正则匹配 ill'
    33 #r"" ,字符串中反斜线 只用写一次
    34 link_node = soup.find('a',href=re.compile(r"ill") )    
    35 print link_node.name, link_node['href'],link_node.get_text()
    36 
    37 print '获取p段落文字'
    38 #r"" ,字符串中反斜线 只用写一次
    39 p_node = soup.find('p',class_="title" )    
    40 print p_node.name, p_node.get_text()

    结果:

    获取所有的链接
    a http://example.com/elsie Elsie
    a http://example.com/lacie Lacie
    a http://example.com/tillie Tillie
    获取lacie的链接
    a http://example.com/lacie Lacie
    正则匹配 ill
    a http://example.com/tillie Tillie
    获取p段落文字
    p The Dormouse's story
  • 相关阅读:
    美团面试(c++方向)
    浪潮面试-软开
    ofo C++面试
    B树、B+树等
    爱奇艺2017秋招笔试(C++智能设备方向)
    腾讯内推一面C++
    i++ 相比 ++i 哪个更高效?为什么?
    进程间的通讯(IPC)方式
    一台服务器能够支持多少TCP并发连接呢?
    可重入和不可重入
  • 原文地址:https://www.cnblogs.com/njczy2010/p/5551976.html
Copyright © 2011-2022 走看看