zoukankan      html  css  js  c++  java
  • Python 爬取B站(Bilibili.com)UP主的所有公开视频链接及信息

    原文链接:https://blog.xieqiaokang.com/posts/36033.html
    Github:https://github.com/xieqk/Bilibili_Spider_by_UserID
    Gitee:https://gitee.com/xieqk/Bilibili_Spider_by_UserID

    环境准备

    • selenium
    • bs4

    安装

    这里使用 conda 安装,也可使用 pip

    conda install selenium bs4
    

    selenium是一个操作浏览器的 Python 库,需要安装相应的浏览器驱动,如 firefox:

    conda install gtk3 firefox -c conda-forge
    

    此外还需要 geckodriver ,可前往 github 下载,并放置于 /usr/local/bin/

    • 也可以放置在自定义路径下(但须为环境变量能够找到的地方),如非管理员用户可放置于自己 home 目录下的 ~/bin 目录下,并将该路径添加进环境变量:
    export PATH=~/bin${PATH:+:${PATH}}
    

    如果需要永久将 ~/bin 路径添加进环境变量,则将上述语句添加进 ~/.bashrc 文件末尾即可(重启命令行生效,或手动输入source ~/.bashrc 在当前命令行激活)。

    • Windows 需下载对应 windows 版本并放置于环境变量能够找到的地方,或手动将 geckodriver 所在路径加入 PATH 中,并重启。

    快速使用

    1. 安装依赖

    见上一节环境准备部分,安装对应依赖环境。

    2. Clone 代码

    # Github (国内访问网速不佳者可使用 Gitee)
    git clone https://github.com/xieqk/Bilibili_Spider_by_UserID.git
    # Gitee
    git clone https://gitee.com/xieqk/Bilibili_Spider_by_UserID.git
    

    3. 查看 B 站用户 uid

    如下图所示,进入该用户主页,地址栏后面红框中的数字即为该用户的 uid

    产看用户uid

    4. 爬取用户视频数据

    进入代码目录中,直接执行 main.py,传入 uid 参数即可:

    python main.py --uid 362548791
    

    爬取结果将保存于当前目录下的 json 目录,以 json 格式保存,为一个列表,内容如下:

    [
        {
            "user_name": "歪西歪小哥哥",	// UP主名字
            "bv": "BV1Wa4y1e7yy",	// BV号
            "url": "https://www.bilibili.com/video/BV1Wa4y1e7yy",	// 视频链接
            "title": "【新冠肺炎:全球各国+中美各省/州】累计确诊人数 & 累计死亡人数数据可视化:俄罗斯情况不容乐观",	// 标题
            "play": "3888",		// 播放量
            "duration": 796,	// 总时长
            "pub_date": "2020-05-16",	// 发布日期
            "now": "2020-11-18 15:47:28"	// 当前日期
        },
        ...
    ]
    

    5. 其它参数

    • --save_dir:保存 json 结果的目录,默认为 json
    • --save_by_page:按页保存用户视频信息,默认为 False(B站用户视频页一页一般为30个视频)。
    • --time:爬取时,浏览器获取页面的等待时间,默认为 2(秒)。网络状况不佳时等待时间过短可能会导致爬取的数据不完全。
    • --detailed:进一步爬取每一个链接的详细信息(弹幕数、是否为播放列表、发布日期及时刻),默认为 False

    当加入 --detailed 参数后每个 url 的爬取结果为:

    [
        {
            "user_name": "歪西歪小哥哥",
            "bv": "BV1Wa4y1e7yy",
            "url": "https://www.bilibili.com/video/BV1Wa4y1e7yy",
            "title": "【新冠肺炎:全球各国+中美各省/州】累计确诊人数 & 累计死亡人数数据可视化:俄罗斯情况不容乐观",
            "play": "3888",
            "duration": 796,
            "pub_date": "2020-05-16 02:17:16",	// 发布日期精确到时分秒
            "now": "2020-11-18 15:47:28",
            "danmu": "85",
            "type": "playlist",		// 链接类型:'video'代表单个视频,'playlist'代表播放列表
            "num": 4	// 分P数,如果为'video'则为1,'playlist'则为播放列表的视频集数
        },
        ...
    ]
    

    详细说明

    详见 utils/bilibili_spider.pyBilibili_Spider()

    1. 初始化

    options = webdriver.FirefoxOptions()
    options.add_argument('--headless')
    self.browser = webdriver.Firefox(options=options)
    

    2. 获取用户视频页数及用户名

    • 获取用户主页下视频页的第一页:
    self.user_url = 'https://space.bilibili.com/{}'.format(uid)
    page_url = self.user_url + '/video?tid=0&page={}&keyword=&order=pubdate'.format(1)
    self.browser.get(page_url)
    time.sleep(self.t+2*random.random())
    html = BeautifulSoup(self.browser.page_source, features="html.parser")
    
    • 获取视频页数:找到页数所在位置,浏览器打开页面后在对应位置检查即可。

    获取用户视频页数

    page_number = html.find('span', attrs={'class':'be-pager-total'}).text
    page_number = int(page_number.split(' ')[1])
    
    • 获取用户名

    获取用户名

    user_name = html.find('span', id = 'h-name').text
    

    3. 获取每一页的视频信息

    获取该页视频列表,并遍历

    查看该页视频列表

    page_url = self.user_url + '/video?tid=0&page={}&keyword=&order=pubdate'.format(idx+1)	# idx 为视频第几页
    self.browser.get(page_url)
    time.sleep(self.t+2*random.random())
    html = BeautifulSoup(self.browser.page_source, features="html.parser")
    
    ul_data = html.find('div', id = 'submit-video-list').find('ul', attrs= {'class': 'clearfix cube-list'})
    
    for li in ul_data.find_all('li'):
        # 获取每个视频的信息:url、标题、日期等
    

    4. 获取每个视频信息

    获取每个视频的相关信息

    for li in ul_data.find_all('li'):
        # 链接和标题
        a = li.find('a', attrs = {'target':'_blank', 'class':'title'})
        a_url = 'https:{}'.format(a['href'])
        a_title = a.text
        # 发布日期及播放数
        date_str = li.find('span', attrs = {'class':'time'}).text.strip()
        pub_date = self.date_convert(date_str)
        now = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        play = int(li.find('span', attrs = {'class':'play'}).text.strip())
        # 总时长
        time_str = li.find('span', attrs = {'class':'length'}).text
        duration = self.time_convert(time_str)
    

    5. 进入视频页获取信息

    • 获取视频详细数据

    进入视频页查看视频详细数据:播放量、弹幕数、发布日期

    # e.g. url = 'https://www.bilibili.com/video/BV1Wa4y1e7yy'
    self.browser.get(url)
    time.sleep(self.t+2*random.random())
    html = BeautifulSoup(self.browser.page_source, features="html.parser")
    
    video_data = html.find('div', id = 'viewbox_report').find_all('span')
    play = int(video_data[1]['title'][4:])
    danmu = int(video_data[2]['title'][7:])
    date = video_data[3].text
    
    • 判断是否为播放列表

    检查是否有 multi_page 字段即可判断是否为播放列表。

    查看是否为播放列表

    # 接上一段代码:“进入视频页获取视频的详细信息”
    multi_page = html.find('div', id = 'multi_page')
    if multi_page is not None:
        url_type = 'playlist'
        pages = multi_page.find('span', attrs= {'class': 'cur-page'}).text
        page_total = int(pages.split('/')[-1])
    else:
        url_type = 'video'
        page_total = 1
    # 也可继续获取播放列表其它信息,如分P的标题,但如果爬取时sleep时间过短可能导致爬取失败。
    

    完整代码

    main.py

    import os
    import os.path as osp
    import argparse
    
    from utils.bilibili_spider import Bilibili_Spider
    
    
    def main(args):
        bilibili_spider = Bilibili_Spider(args.uid, args.save_dir, args.save_by_page, args.time)
        bilibili_spider.get()
        if args.detailed:
            bilibili_spider.get_detail()
    
    
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument('--uid', type=str, default='362548791')
        parser.add_argument('--save_dir', type=str, default='json')
        parser.add_argument('--save_by_page', action='store_true', default=False)
        parser.add_argument('--time', type=int, default=2, help='waiting time for browser.get(url) by seconds')
        parser.add_argument('--detailed', action='store_true', default=False)
        args = parser.parse_args()
        print(args)
        
        main(args)
    

    utils/bilibili_spider.py

    import re
    import os
    import os.path as osp
    import sys
    import json
    import time
    import argparse
    import datetime
    from selenium import webdriver
    from bs4 import BeautifulSoup
    from urllib import parse as url_parse
    import random
    
    from .tools import mkdir_if_missing, write_json, read_json
    
    
    class Bilibili_Spider():
    
        def __init__(self, uid, save_dir_json='json', save_by_page=False, t=2):
            self.t = t
            self.uid = uid
            self.user_url = 'https://space.bilibili.com/{}'.format(uid)
            self.save_dir_json = save_dir_json
            self.save_by_page = save_by_page
            options = webdriver.FirefoxOptions()
            options.add_argument('--headless')
            self.browser = webdriver.Firefox(options=options)
            print('spider init done.')
    
        def close(self):
            # 关闭浏览器驱动
            self.browser.quit()
    
        def time_convert(self, time_str):
            time_item = time_str.split(':')
            assert len(time_item) == 2, 'time format error: {}, x:x expected!'.format(time_str)
            seconds = int(time_item[0])*60 + int(time_item[1])
            return seconds
    
        def date_convert(self, date_str):
            date_item = date_str.split('-')
            assert len(date_item) == 2 or len(date_item) == 3, 'date format error: {}, x-x or x-x-x expected!'.format(date_str)
            if len(date_item) == 2:
                year = datetime.datetime.now().strftime('%Y')
                date_str = '{}-{:>02d}-{:>02d}'.format(year, int(date_item[0]), int(date_item[1]))
            else:
                date_str = '{}-{:>02d}-{:>02d}'.format(date_item[0], int(date_item[1]), int(date_item[2]))
            return date_str
    
        def get_page_num(self):
            page_url = self.user_url + '/video?tid=0&page={}&keyword=&order=pubdate'.format(1)
            self.browser.get(page_url)
            time.sleep(self.t+2*random.random())
            html = BeautifulSoup(self.browser.page_source, features="html.parser")
    
            page_number = html.find('span', attrs={'class':'be-pager-total'}).text
            user_name = html.find('span', id = 'h-name').text
    
            return int(page_number.split(' ')[1]), user_name
    
        def get_videos_by_page(self, idx):
            # 获取第 page_idx 页的视频信息
            urls_page, titles_page, plays_page, dates_page, durations_page = [], [], [], [], []
            page_url = self.user_url + '/video?tid=0&page={}&keyword=&order=pubdate'.format(idx+1)
            self.browser.get(page_url)
            time.sleep(self.t+2*random.random())
            html = BeautifulSoup(self.browser.page_source, features="html.parser")
    
            ul_data = html.find('div', id = 'submit-video-list').find('ul', attrs= {'class': 'clearfix cube-list'})
    
            for li in ul_data.find_all('li'):
                # url & title
                a = li.find('a', attrs = {'target':'_blank', 'class':'title'})
                a_url = 'https:{}'.format(a['href'])
                a_title = a.text
                # pub_date & play
                date_str = li.find('span', attrs = {'class':'time'}).text.strip()
                pub_date = self.date_convert(date_str)
                now = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                play = int(li.find('span', attrs = {'class':'play'}).text.strip())
                # duration
                time_str = li.find('span', attrs = {'class':'length'}).text
                duration = self.time_convert(time_str)
                # append
                urls_page.append(a_url)
                titles_page.append(a_title)
                dates_page.append((pub_date, now))
                plays_page.append(play)
                durations_page.append(duration)
    
            return urls_page, titles_page, plays_page, dates_page, durations_page
    
        def save(self, json_path, bvs, urls, titles, plays, durations, dates):
            data_list = []
            for i in range(len(urls)):
                result = {}
                result['user_name'] = self.user_name
                result['bv'] = bvs[i]
                result['url'] = urls[i]
                result['title'] = titles[i]
                result['play'] = plays[i]
                result['duration'] = durations[i]
                result['pub_date'] = dates[i][0]
                result['now'] = dates[i][1]
                data_list.append(result)
            
            print('write json to {}'.format(json_path))
            dir_name = osp.dirname(json_path)
            mkdir_if_missing(dir_name)
            write_json(data_list, json_path)
            print('dump json file done. total {} urls. 
    '.format(len(urls)))
    
        def get(self):
            # 获取该 up 主的所有基础视频信息
            print('Start ... 
    ')
            self.page_num, self.user_name = self.get_page_num()
            while self.page_num == 0:
                print('Failed to get user page num, poor network condition, retrying ... ')
                self.page_num, self.user_name = self.get_page_num()
            print('Pages Num {}, User Name: {}'.format(self.page_num, self.user_name))
    
            bvs = []
            urls = []
            titles = []
            plays = []
            dates = []
            durations = []   # by seconds
    
            for idx in range(self.page_num):
                print('>>>> page {}/{}'.format(idx+1, self.page_num))
                urls_page, titles_page, plays_page, dates_page, durations_page = self.get_videos_by_page(idx)
                while len(urls_page) == 0:
                    print('failed, try again page {}/{}'.format(idx+1, self.page_num))
                    urls_page, titles_page, plays_page, dates_page, durations_page = self.get_videos_by_page(idx)
                bvs_page = [x.split('/')[-1] for x in urls_page]
                assert len(urls_page) == len(titles_page), '{} != {}'.format(len(urls_page), len(titles_page)) 
                assert len(urls_page) == len(plays_page), '{} != {}'.format(len(urls_page), len(titles_page)) 
                assert len(urls_page) == len(dates_page), '{} != {}'.format(len(urls_page), len(dates_page))  
                assert len(urls_page) == len(durations_page), '{} != {}'.format(len(urls_page), len(durations_page))  
                print('result:')
                print('{}_{}: '.format(self.user_name, self.uid), bvs_page, ', {} in total'.format(len(urls_page)))
                sys.stdout.flush()
                bvs.extend(bvs_page)
                urls.extend(urls_page)
                titles.extend(titles_page)
                plays.extend(plays_page)
                dates.extend(dates_page)
                durations.extend(durations_page)
                if self.save_by_page:
                    json_path_page = osp.join(self.save_dir_json, '{}_{}'.format(self.user_name, self.uid), 'primary', 'page_{}.json'.format(idx+1))
                    self.save(json_path_page, bvs_page, urls_page, titles_page, plays_page, durations_page, dates_page)
    
            json_path = osp.join(self.save_dir_json, '{}_{}'.format(self.user_name, self.uid), 'primary', 'full.json')
            self.save(json_path, bvs, urls, titles, plays, durations, dates)
    
        def get_url(self, url):
            self.browser.get(url)
            time.sleep(self.t+2*random.random())
            html = BeautifulSoup(self.browser.page_source, features="html.parser")
    
            video_data = html.find('div', id = 'viewbox_report').find_all('span')
            play = int(video_data[1]['title'][4:])
            danmu = int(video_data[2]['title'][7:])
            date = video_data[3].text
    
            multi_page = html.find('div', id = 'multi_page')
            if multi_page is not None:
                url_type = 'playlist'
                pages = multi_page.find('span', attrs= {'class': 'cur-page'}).text
                page_total = int(pages.split('/')[-1])
            else:
                url_type = 'video'
                page_total = 1
            
            return play, danmu, date, url_type, page_total
        
        def get_detail(self):
            print('Start to get detailed information for each url.')
            if self.save_by_page:
                data = []
                for idx in range(self.page_num):
                    json_path = osp.join(self.save_dir_json, '{}_{}'.format(self.user_name, self.uid), 'primary', 'page_{}.json'.format(idx+1))
                    data_page = read_json(json_path)
                    for j, item in enumerate(data_page):
                        url = item['url']
                        print('>>>> page {}/{}, No. {}/{}'.format(idx+1, self.page_num, j+1, len(data_page)))
                        play, danmu, date, url_type, page_total = self.get_url(url)
                        # print(play, danmu, date, url_type, page_total)
                        assert page_total > 0, page_total
                        if page_total == 1:
                            assert url_type == 'video', (url_type, page_total)
                            data_page[j]['play'] = play
                            data_page[j]['danmu'] = danmu
                            data_page[j]['pub_date'] = date
                            data_page[j]['type'] = url_type
                            data_page[j]['num'] = page_total
                        else:
                            assert url_type == 'playlist', (url_type, page_total)
                            data_page[j]['play'] = play
                            data_page[j]['danmu'] = danmu
                            data_page[j]['pub_date'] = date
                            data_page[j]['type'] = url_type
                            data_page[j]['num'] = page_total
    
                    json_path_save = osp.join(self.save_dir_json, '{}_{}'.format(self.user_name, self.uid), 'detailed', 'page_{}.json'.format(idx+1))
                    print('write json to {}'.format(json_path_save))
                    write_json(data_page, json_path_save)
                    print('dump json file done. total {} urls. 
    '.format(len(data_page)))
                    data.extend(data_page)
                
                json_path_save = osp.join(self.save_dir_json, '{}_{}'.format(self.user_name, self.uid), 'detailed', 'full.json')
                print('write json to {}'.format(json_path_save))
                write_json(data, json_path_save)
                print('dump json file done. total {} urls. 
    '.format(len(data)))
            else:
                json_path = osp.join(self.save_dir_json, '{}_{}'.format(self.user_name, self.uid), 'primary', 'full.json')
                data = read_json(json_path)
                for j, item in enumerate(data):
                    url = item['url']
                    print('>>>> No. {}/{}'.format(j+1, len(data)))
                    play, danmu, date, url_type, page_total = self.get_url(url)
                    assert page_total > 0, page_total
                    if page_total == 1:
                        assert url_type == 'video', (url_type, page_total)
                        data[j]['play'] = play
                        data[j]['danmu'] = danmu
                        data[j]['pub_date'] = date
                        data[j]['type'] = url_type
                        data[j]['num'] = page_total
                    else:
                        assert url_type == 'playlist', (url_type, page_total)
                        data[j]['play'] = play
                        data[j]['danmu'] = danmu
                        data[j]['pub_date'] = date
                        data[j]['type'] = url_type
                        data[j]['num'] = page_total
                
                json_path_save = osp.join(self.save_dir_json, '{}_{}'.format(self.user_name, self.uid), 'detailed', 'full.json')
                print('write json to {}'.format(json_path_save))
                write_json(data, json_path_save)
                print('dump json file done. total {} urls. 
    '.format(len(data)))
    

    utils/tools.py

    import sys
    import os
    import os.path as osp
    import time
    import errno
    import json
    import warnings
    
    
    def mkdir_if_missing(dirname):
        """Creates dirname if it is missing."""
        if not osp.exists(dirname):
            try:
                os.makedirs(dirname)
            except OSError as e:
                if e.errno != errno.EEXIST:
                    raise
    
    
    def check_isfile(fpath):
        """Checks if the given path is a file."""
        isfile = osp.isfile(fpath)
        if not isfile:
            warnings.warn('No file found at "{}"'.format(fpath))
        return isfile
    
    
    def read_json(fpath):
        """Reads json file from a path."""
        with open(fpath, 'r') as f:
            obj = json.load(f)
        return obj
    
    
    def write_json(obj, fpath):
        """Writes to a json file."""
        mkdir_if_missing(osp.dirname(fpath))
        with open(fpath, 'w', encoding='utf-8') as f:
            json.dump(obj, f, indent=4, separators=(',', ': '), ensure_ascii=False) # 添加中文支持
    
  • 相关阅读:
    ASP.Net设计时需要考虑的性能优化问题 转载自http://blog.sina.com.cn/s/blog_3d7bed650100055p.html
    Jqeruy dropdownlist 联动
    Sound Recording in Windows Phone 7
    Windows Phone Performance 系列网址集合
    【xml]: Read XML with Namespace resolution using XLinq.XElement
    编程基础字符篇(2)
    每日总结一:
    每日总结5:
    Control usage: (1) Windows Phone 7: Popup control
    编程基础字符篇(3)
  • 原文地址:https://www.cnblogs.com/xieqk/p/14001293.html
Copyright © 2011-2022 走看看