zoukankan      html  css  js  c++  java
  • python系列之(1)BeautifulSoup的用法

    好久没更新博客了。打算写一个python的爬虫系列及数据分析。falg也不能随便立,以免打脸。

    python爬取内容,是过程,分析数据是结果,最终得出结论才是目的。python爬虫爬取了内容,一般都是从网页上获取,那我们从html页面中如何提取出自己想要的信息呢?那就需要解析。目前常用的有BeautifulSoup、PyQuery、XPath和正则表达式。正则容易出错,而且一直是弱项,就讲讲其他三个的使用,今天先看下BeautifulSoup.

    一、简介

    BeautifulSoup直译为美丽的汤。是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。

    二、安装

     pip install beautifulsoup4

    三、准备测试代码

    这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档)

    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    </body>
    </html>

    我们先以上述代码为例进行测试

    四、使用

    from bs4 import BeautifulSoup
    
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    </body>
    </html>
    """
    soup = BeautifulSoup(html_doc, features="html.parser")
    #print(soup.prettify())
    
    print(soup.title)
    #<title>The Dormouse's story</title>
    print(soup.title.name)
    #title
    print(soup.title.string)
    #The Dormouse's story
    print(soup.title.parent.name)
    #head
    
    print(soup.p)
    #<p class="title"><b>The Dormouse's story</b></p>
    print(soup.p['class'])
    #[u'title']
    
    print(soup.a)
    #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    print(soup.find_all('a'))
    #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    print(soup.find(id='link3'))
    #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    
    for link in soup.find_all('a'):
    print(link.get('href'))
    #http://example.com/elsie
    #http://example.com/lacie
    #http://example.com/tillie
    
    print(soup.get_text())
    #The Dormouse's story
    
    #The Dormouse's story
    #Once upon a time there were three little sisters; and their names were
    #Elsie,
    #Lacie and
    #Tillie;
    #and they lived at the bottom of a well.
    #...

    以上注释的都是上一行输出的

    五、BeautifulSoup可以传入字符串或文件句柄

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', features="lxml")
    tag = soup.b
    print(tag)
    #<b class="boldest">Extremely bold</b>
    tag.name = "blockquote"
    print(tag)
    #<blockquote class="boldest">Extremely bold</blockquote>
    print(tag['class'])
    #['boldest']
    print(tag.attrs)
    #{'class': ['boldest']}
    tag['id']="stylebs"
    print(tag)
    #<blockquote class="boldest" id="stylebs">Extremely bold</blockquote>
    del tag['id'] 
    print(tag)
    #<blockquote class="boldest">Extremely bold</blockquote>
            
    css_soup = BeautifulSoup('<p class="body strikeout"></p>', features="lxml")
    print(css_soup.p['class'])
    #['body', 'strikeout']
    
    id_soup = BeautifulSoup('<p id="my id"></p>', features="lxml")
    print(id_soup.p['id'])
    #my id 
        
    rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>', features="lxml")
    print(rel_soup.a['rel'])
    #['index']
    rel_soup.a['rel'] = ['index', 'contents']
    print(rel_soup.p)
            

    参考文档 : https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id40

  • 相关阅读:
    随便说说
    郁闷
    请各栏目的负责人,开始整理自己栏目的文章
    祝博客园生日快乐
    Windows Live Writer中打开博客日志(最新版可以支持打开3000以内的日志)
    编译器优化对齐(字节对齐)
    HDlock 锁住硬盘的解决方式
    linux中env,export, set的区别
    System Volume Information 文件夹权限控制
    BOOL与bool的区别(bool不是c的关键字,c++中bool也不是int)
  • 原文地址:https://www.cnblogs.com/kumufengchun/p/11699687.html
Copyright © 2011-2022 走看看