zoukankan      html  css  js  c++  java
  • Python BeautifulSoup 使用

    BS4库简单使用:
    1.最好配合LXML库,下载:pip install lxml
    2.最好配合Requests库,下载:pip install requests
    3.下载bs4:pip install bs4
    4.直接输入pip没用?解决:环境变量->系统变量->Path->新建:C:Python27Scripts
     
    案例:获取网站标题
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
    import requests
     
    url = "https://www.baidu.com"
     
    response = requests.get(url)
     
    soup = BeautifulSoup(response.content, 'lxml')
     
    print soup.title.text
     
    标签识别
    示例1:
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
     
    html = '''
    <html>
    <head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    </body>
    </html>
    '''
    soup = BeautifulSoup(html, 'lxml')
     
    # BeautifulSoup中有内置的方法来实现格式化输出
    print(soup.prettify())
     
    # title标签内容
    print(soup.title.string)
     
    # title标签的父节点名
    print(soup.title.parent.name)
     
    # 标签名为p的内容
    print(soup.p)
     
    # 标签名为p的class内容
    print(soup.p["class"])
     
    # 标签名为a的内容
    print(soup.a)
     
    # 查找所有的字符a
    print(soup.find_all('a'))
     
    # 查找id='link3'的内容
    print(soup.find(id='link3'))
     
    示例2:
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
     
    html = '''
    <html>
    <head><title>The Dormouse's story</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    </body>
    </html>
    '''
     
    soup = BeautifulSoup(html, 'lxml')
     
    # 将p标签下的所有子标签存入到了一个列表中
    print (soup.p.contents)
     
    find_all示例:
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
     
    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
     
    soup = BeautifulSoup(html, 'lxml')
     
    # 查找所有的ul标签内容
    print(soup.find_all('ul'))
     
    # 针对结果再次find_all,从而获取所有的li标签信息
    for ul in soup.find_all('ul'):
        print(ul.find_all('li'))
     
    # 查找id为list-1的内容
    print(soup.find_all(attrs={'id': 'list-1'}))
     
    # 查找class为element的内容
    print(soup.find_all(attrs={'class': 'element'}))
     
    # 查找所有的text='Foo'的文本
    print(soup.find_all(text='Foo'))
     
    CSS选择器示例:
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
     
    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
     
    soup = BeautifulSoup(html, 'lxml')
     
    # 获取class名为panel下panel-heading的内容
    print(soup.select('.panel .panel-heading'))
     
    # 获取class名为ul和li的内容
    print(soup.select('ul li'))
     
    # 获取class名为element,id为list-2的内容
    print(soup.select('#list-2 .element'))
     
    # 使用get_text()获取文本内容
    for li in soup.select('li'):
        print(li.get_text())
     
    # 获取属性的时候可以通过[属性名]或者attrs[属性名]
    for ul in soup.select('ul'):
        print(ul['id'])
        # print(ul.attrs['id'])
     
  • 相关阅读:
    从与计算机结缘说起
    个人作业2:APP案例分析
    团队作业4——第一次项目冲刺(Alpha版本)第二篇
    团队项目作业1团队展示与选题
    团队作业4——第一次项目冲刺(Alpha版本)第三篇
    团队作业3——需求改进&系统设计
    技术博客
    技术博客二
    bootstrap前端框架使用总结分享
    ADO.NET Entities Framework 的增删查改(我自己写的,可以作为范例)
  • 原文地址:https://www.cnblogs.com/xuyiqing/p/10295367.html
Copyright © 2011-2022 走看看