zoukankan      html  css  js  c++  java
  • python标准库Beautiful Soup与MongoDb爬喜马拉雅电台的总结

    Beautiful Soup标准库是一个可以从HTML/XML文件中提取数据的Python库,它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式,Beautiful Soup将会节省数小时的工作时间。pymongo标准库是MongoDb NoSql数据库与python语言之间的桥梁,通过pymongo将数据保存到MongoDb中。结合使用这两者来爬去喜马拉雅电台的数据...

    Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml。本文使用的就是lxml,对于这个的安装,请看 python 3.6 lxml标准库lxml的安装及etree的使用注意
    同时,本文使用了XPath来解析我们想要的部分,对于XPath与Beautiful Soup的介绍与使用请看 Beautiful Soup 4.4.0 文档 XPath 简介
    本文涉及到的Beautiful Soup与XPath的知识不是很深,看看官方文档就能理解,而且我还加上了注释...
    对于pymongo标准库,我就不多扯淡了,详情请看 python标准库之pymongo模块次体验

    有时候,我们需要判断当前向服务器发出请求的客户端的类型,也就是通常所说的User-Agent,简称UA,我们在浏览网页时所使用的浏览器就是UA的一种,换言之,UA就是浏览器,在HTTP协议中,通过User-Agent请求头说明用户浏览器的类型,操作系统,浏览器内核等信息的标识。通过这个标识,用过所访问的网站可以显示不同的版本,从而为用户提供更好的体验或者进行信息统计。而有些网站正式利用UA来防止黑客或是像我们这种无聊的人来爬去网站的数据信息。
    因此,本文代码首先就把所有的UA都给列取出来,以方便后续的爬取工作。

    好了,下面来明确下我们要爬取得数据是什么:

    我们需要的是图片的链接,alt等

    随后我们点击图片链接之后,获取里面的详情,如果有些电台是多页的,那么我们用过xpath来依次访问。同时我们获取页面中专辑里的声音模块的sound_id...

    程序如下:

    import random
    import requests
    from bs4 import BeautifulSoup
    import json
    from lxml import etree
    import pymongo
    
    
    clients = pymongo.MongoClient("localhost", 27017)
    db = clients["XiMaLaYa"]
    collection_1 = db["album"]
    collection_2 = db["detail"]
    
    UA_LIST = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    headers1 = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
        'Cache-Control': 'max-age=0',
        'Proxy-Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': random.choice(UA_LIST)  # User_agence表示用户代理
    }
    headers2 = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
        'Cache-Control': 'max-age=0',
        'Proxy-Connection': 'keep-alive',
        'Referer': 'http://www.ximalaya.com/dq/all/2',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': random.choice(UA_LIST)
    }
    
    
    # Beautiful库用来处理XML和HTML...
    # 主要就是利用BeautifulSoup模块来处理requests模块获取的Html源码
    # 利用lxml模块将html源码解析成树结构,xpath来处理树节点.
    def get_url():
        start_urls = ["http://www.ximalaya.com/dq/all/{}".format(num) for num in range(1,85)]
        # start_urls = ["http://www.ximalaya.com/dq/all/1"]
        for start_url in start_urls:
            html = requests.get(start_url, headers=headers1).text
            soup = BeautifulSoup(html, "lxml")  # 使用lxml来处理
            for item in soup.find_all(class_="albumfaceOutter"):  # 解析并查找xml节点
                content = {
                    'href': item.a["href"],
                    'title': item.img['alt'],
                    'img_url': item.img['src']
                }
                collection_1.insert(content)
                # another(item.a["href"])
        print('写入完成...')
    
    
    # 进入电台具体页面 http://www.ximalaya.com/15836959/album/303085,并处理分页录音...
    def another(url):
        html = requests.get(url, headers=headers1).text
        # / :表示从根节点选取....
        # // :表示匹配选择的当前节点选择文档中的节点,而不考虑他们的位置...
        ifanother = etree.HTML(html).xpath('//div[@class="pagingBar_wrapper"]/a[last()-1]/@data-page')  # 页面链接地址  ifanother是list类型...
        if len(ifanother):  # 判断一个video的录音是否分割成了多页....
            num = ifanother[0]  # 获取页面数...
            print('本频道保存在' + num + '个页面')
            for n in range(1, int(num)):
                url2 = url + '?page={}'.format(n)
                get_m4a(url2)
            get_m4a(url)
    
    
    # 获取分页录音页面的详细数据...
    def get_m4a(url):
        html = requests.get(url, headers=headers2).text
        numlist = etree.HTML(html).xpath('//div[@class="personal_body"]/@sound_ids')[0].split(',')
        for i in numlist:
            murl = 'http://www.ximalaya.com/tracks/{}.json'.format(i)
            html = requests.get(murl, headers=headers1).text
            dic = json.loads(html)
            collection_2.insert(dic)
    
    
    if __name__ == "__main__":
        get_url()
  • 相关阅读:
    Saltstack module acl 详解
    Saltstack python client
    Saltstack简单使用
    P5488 差分与前缀和 NTT Lucas定理 多项式
    CF613D Kingdom and its Cities 虚树 树形dp 贪心
    7.1 NOI模拟赛 凸包套凸包 floyd 计算几何
    luogu P5633 最小度限制生成树 wqs二分
    7.1 NOI模拟赛 dp floyd
    springboot和springcloud
    springboot集成mybatis
  • 原文地址:https://www.cnblogs.com/zhiyong-ITNote/p/7172433.html
Copyright © 2011-2022 走看看