zoukankan      html  css  js  c++  java
  • python 爬取猫眼电影top100数据

    最近有爬虫相关的需求,所以上B站找了个视频(链接在文末)看了一下,做了一个小程序出来,大体上没有修改,只是在最后的存储上,由txt换成了excel。

    • 简要需求:爬虫爬取 猫眼电影TOP100榜单 数据
    • 使用语言:python
    • 工具:PyCharm
    • 涉及库:requests、re、openpyxl(高版本excel操作库)

    实现代码

    猫眼电影Robots

    # -*- coding: utf-8 -*-
    # @Author  : yocichen
    # @Email   : yocichen@126.com
    # @File    : maoyan100.py
    # @Software: PyCharm
    # @Time    : 2019
    # @UpdateTime : 2020/4/26
    
    import requests
    from requests import RequestException
    import re
    import openpyxl
    import traceback
    
    # Get page's html by requests module
    def get_one_page(url):
        try:
            headers = {
                'user-agent': 'Mozilla / 5.0(Windows NT 10.0; WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 53.0.2785.104Safari / 537.36Core / 1.53.4882.400QQBrowser / 9.7.13059.400'
            }
            # Sometimes, the proxies need to be replaced.
            # You can get them by accessing https://www.kuaidaili.com/free/inha/
            proxies = {
                'http': '60.190.250.120:8080'
            }
            # use headers to avoid 403 Forbidden Error(reject spider)
            response = requests.get(url, headers=headers, proxies=proxies)
            if response.status_code == 200 :
                return response.text
            return None
        except RequestException:
            traceback.print_exc()
            return None
    
    # Get useful info from html of a page by re module
    def parse_one_page(html):
        try:
            pattern = re.compile('<dd>.*?board-index.*?>(d+)<.*?<a.*?title="(.*?)"'
                                 +'.*?data-src="(.*?)".*?</a>.*?star">[\s]*(.*?)[\n][\s]*</p>.*?'
                                 +'releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?'
                                 +'fraction">(.*?)</i>.*?</dd>', re.S)
            items = re.findall(pattern, html)
            return items
        except Exception:
            traceback.print_exc()
            return []
    
    # Main call function
    def main(url):
        page_html = get_one_page(url)
        parse_res = parse_one_page(page_html)
        return parse_res
    
    # Write the useful info in excel(*.xlsx file)
    def write_excel_xlsx(items):
        wb = openpyxl.Workbook()
        ws = wb.active
        rows = len(items)
        cols = len(items[0])
        # First, write col's title.
        ws.cell(1, 1).value = '编号'
        ws.cell(1, 2).value = '片名'
        ws.cell(1, 3).value = '宣传图片'
        ws.cell(1, 4).value = '主演'
        ws.cell(1, 5).value = '上映时间'
        ws.cell(1, 6).value = '评分'
        # Write film's info
        for i in range(0, rows):
            for j in range(0, cols):
                if j != 5:
                    ws.cell(i+2, j+1).value = items[i][j]
                else:
                    ws.cell(i+2, j+1).value = items[i][j]+items[i][j+1]
                    break
        # Save the work book as *.xlsx
        wb.save('maoyan_top100.xlsx')
    
    if __name__ == '__main__':
        print('spider working...')
        res = []
        url = 'https://maoyan.com/board/4?'
        for i in range(0, 10):
            if i == 0:
                res = main(url)
            else:
                newUrl = url+'offset='+str(i*10)
                res.extend(main(newUrl))
        print('writing into excel...')
        write_excel_xlsx(res)
        print('work done!
    Note: the data is in the current directory.')

    更新效果图:

    后记

    入门了一点后发现,如果使用正则表达式和requests库来实行进行数据爬取的话,分析HTML页面结构和正则表达式的构造是关键,剩下的工作不过是替换url罢了。

    你可能需要的 GitHub 传送门


    补充一个分析HTML构造正则的例子

    猫眼经典科幻按照评价排序

    审查元素我们会发现每一项都是<dd>****</dd>格式

     我想要获取电影名称和评分,先拿出HTML代码看一看

    试着构造正则

    '.*?<dd>.*?movie-item-title.*?title="(.*?)">.*?integer">(.*?)<.*?fraction">(.*?)<.*?</dd>' (随手写的,未经验证)


    参考资料

    【B站视频 2018年最新Python3.6网络爬虫实战】https://www.bilibili.com/video/av19057145/?p=14

    【猫眼电影robots】https://maoyan.com/robots.txt (最好爬之前去看一下,那些可爬那些不允许爬)

  • 相关阅读:
    VOA 2009/11/02 DEVELOPMENT REPORT In Kenya, a Better Life Through Mobile Money
    2009.11.26教育报道在美留学生数量创历史新高
    Java中如何实现Tree的数据结构算法
    The Python Tutorial
    VOA HEALTH REPORT Debate Over New Guidelines for Breast Cancer Screening
    VOA ECONOMICS REPORT Nearly Half of US Jobs Now Held by Women
    VOA ECONOMICS REPORT Junior Achievement Marks 90 Years of Business Education
    VOA 2009/11/07 IN THE NEWS A Second Term for Karzai; US Jobless Rate at 10.2%
    Ant入门
    Python 与系统管理
  • 原文地址:https://www.cnblogs.com/yocichen/p/11812637.html
Copyright © 2011-2022 走看看