zoukankan      html  css  js  c++  java
  • 解析库beautifulsoup

    一、 Beautiful Soup 介绍

    Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4

    #安装 Beautiful Soup
    pip install beautifulsoup4
    pip install lxml

    二、使用方法

    1.find与find_all

    import requests
    from bs4 import BeautifulSoup
    url = 'https://www.autohome.com.cn/news/1/#liststart'  # 获取汽车之家新闻消息
    res = requests.get(url)
    soup = BeautifulSoup(res.text,'lxml')
    div = soup.find(id='auto-channel-lazyload-article')  # 获取到第一页新闻的内容
    
    
    
    li_list = div.find_all(name='li')
    for li in li_list:
        h3 = li.find(name='h3')
        if h3:
            title = h3.text
            print(title)  # 把h3标签的text取出来
    
        a = li.find(name='a')
        if a:
            article_url = a.get('href')  # 取出a标签的href属性
            print(article_url)
    
        img = li.find(name='img')
        if img:
            img_url = img.get('src')  # 取出照片地址
            print(img_url)
    
        p = li.find(name='p')
        if p:
            content = p.text  # 取出p标签里的文本内容
            print(content)
    find:
      -name="标签名"  标签
      -id,class_,=""  把这个标签拿出来
      -标签.text  取标签的内容
      -标签.get(属性名) 取标签属性的内容

    2.css选择器

    from bs4 import BeautifulSoup
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    
    <p class="title" id="bbaa"><b name="xx" age="18">The Dormouse's story</b><b>xxxx</b></p>
    <p class="xxx" a="xxx">asdfasdf</p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """
    
    
    soup=BeautifulSoup(html_doc,'lxml')
    
    
    # 美化
    res=soup.prettify()   #美化一下
    soup=BeautifulSoup(res,'lxml')
    print(res)
    
    
    #遍历文档树
    print(soup.p.name)
    print(soup.p.attrs)
    print(soup.p.string)
    print(list(soup.p.strings))
    print(soup.p.text)
    
    print(soup.body.p.text)
    print(soup.body.p.contents)
    print(list(soup.body.p.children))
    print(list(soup.body.p.descendants))
    print(soup.body.p.parent)
    print(list(soup.body.p.parents))
    print(len(list(soup.body.p.parents)))
    print(soup.body.p.previous_sibling)
    print(soup.body.p.previous_sibling)
    print(soup.find(class_="xxx").previous_sibling)
    print(soup.a.next_sibling)
    print(soup.a.previous_sibling)
    print(type(soup.p))
    
    
    #查找文档
    #五种过滤器 :字符串,正则,布尔,方法,列表
    import re
    print(soup.find_all(name='b'))
    
    
    print(soup.find_all(name=re.compile('^b')))
    print(soup.find_all(id=re.compile('^b')))
    
    
    print(soup.find_all(name=['a','b']))
    print(soup.find_all(name=True))
    
    def has_class_but_no_id(tag):
        return tag.has_attr('class') and not tag.has_attr('id')
    print(soup.find_all(name=has_class_but_no_id))
    
    
    
    
    #css选择
    # xpath
    # print(soup.select(".title"))
    # print(soup.select("#bbaa"))
    
    # print(soup.select('#bbaa b')[0].attrs.get('name'))
    
    #recursive=False  只找同一层
    #limit  找到第几个之后停止
    
    sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",'lxml')
    print(sibling_soup.b.next_sibling)
    print(sibling_soup.c.previous_sibling )
  • 相关阅读:
    Uva 11401 数三角形
    Uva 11538 象棋中的皇后
    数学基础——基本计数方法
    八数码问题
    python 爬poj.org的题目
    python 爬图片
    hiho 第135周 九宫
    Uva 11464 偶数矩阵
    BZOJ 1001 [BeiJing2006]狼抓兔子
    LA 3708 墓地雕塑
  • 原文地址:https://www.cnblogs.com/xiongying4/p/11936164.html
Copyright © 2011-2022 走看看