zoukankan      html  css  js  c++  java
  • 使用Python的BeautifulSoup 类库采集网页内容

    BeautifulSoup  一个分析、处理DOM树的类库。可以做网络爬虫。模块简称bs4。

    安装类库

    easy_install beautifulsoup4  
      
    pip install beautifulsoup4  

    下面是一些用法

    from urllib.request    import    urlopen
    from bs4 import    BeautifulSoup
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister text-bold text-danger" id="link3" title="this is title!">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="red">...</p>
    <p class="green">...</p>
    <p class="red green">...</p>
    </body>
    </html>
    """
    
    soup = BeautifulSoup(html_doc, "html.parser")
    
    link3 = soup.find(id='link3')
    
    #  <a class="sister" href="http://example.com/tillie" id="link3" title="this is title!">Tillie</a>
    print(link3)
    
    #  <class 'bs4.element.Tag'>
    print(type(link3))
    
    # {'href': 'http://example.com/tillie', 'title': 'this is title!', 'id': 'link3', 'class': ['sister', 'text-bold', 'text-danger']}
    print(link3.attrs)
    
    # Tillie
    print(link3.get_text())
    
    # this is title!
    print(link3["title"])
    
    
    
    all_a = soup.find_all('a')
    
    #  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    print(all_a[0])
    
    #  ['Elsie', 'Lacie', 'Tillie']
    print(soup.find_all(text=["Tillie", "Elsie", "Lacie"]))
    
    #  [<p class="red green">...</p>]
    print(soup.find_all("p", {"class":"red", "class":"red green"}))

    一个例子

    采集所有img标签的title属性的内容

    # -*- coding: utf-8 -*- 
    
    from    urllib.request    import    urlopen
    from    urllib.error    import    HTTPError
    from    bs4    import    BeautifulSoup
    
    url  = "http://qa.beloved999.com/category/view?id=2"
    url  = "http://beloved.finley.com/category/view?id=24"
    html = urlopen(url)
    bs   = BeautifulSoup(html.read(),"html.parser")
    res  = bs.findAll("img", "item-image")
    print(len(res))
    for a in res:
        print(a['title'])
        

    注意,有些网站会失败,返回403 forbidden。比如我试的开源中国,可能更header头有关。

    经查,发送的HTTP_USER_AGENT是Python-urllib/3.4。包含HTTP的信息有

    'HTTP_ACCEPT_ENCODING' => 'identity'
    'HTTP_CONNECTION' => 'close'
    'HTTP_HOST' => 'beloved.finley.com'
    'HTTP_USER_AGENT' => 'Python-urllib/3.4'  。

  • 相关阅读:
    PAT 甲级 1132 Cut Integer (20 分)
    AcWing 7.混合背包问题
    AcWing 9. 分组背包问题
    AcWing 5. 多重背包问题 II
    AcWing 3. 完全背包问题
    AcWing 4. 多重背包问题
    AcWing 2. 01背包问题
    AcWing 875. 快速幂
    AcWing 874. 筛法求欧拉函数
    AcWing 873. 欧拉函数
  • 原文地址:https://www.cnblogs.com/mafeifan/p/4655782.html
Copyright © 2011-2022 走看看