zoukankan      html  css  js  c++  java
  • 爬虫综合大作业

    本次作业爬取的是最近上映的很火热的电影《反贪风暴》。希望可以爬取一些有意义的东西。

    最新电影票房排行明细:

    Scrapy使用的基本流程:

    1. 引擎从调度器中取出一个链接(URL)用于接下来的抓取
    2. 引擎把URL封装成一个请求(Request)传给下载器
    3. 下载器把资源下载下来,并封装成应答包(Response)
    4. 爬虫解析Response
    5. 解析出实体(Item),则交给实体管道进行进一步的处理
    6. 解析出的是链接(URL),则把URL交给调度器等待抓取

    一.把爬取的内容保存取MySQL数据库

    主要代码如下:

    城市,评论,号码,昵称,评论时间,用户等级

    import scrapy
     
     
    class MaoyanItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        city = scrapy.Field()  # 城市
        content = scrapy.Field()  # 评论
        user_id = scrapy.Field()  # 用户id
        nick_name = scrapy.Field()  # 昵称
        score = scrapy.Field()  # 评分
        time = scrapy.Field()  # 评论时间
        user_level = scrapy.Field()  # 用户等级

    comment.py

    import scrapy
    import random
    from scrapy.http import Request
    import datetime
    import json
    from maoyan.items import MaoyanItem
     
    class CommentSpider(scrapy.Spider):
        name = 'comment'
        allowed_domains = ['maoyan.com']
        uapools = [
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1',
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
            'Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50',
            'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)',
            'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.0)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
            'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)',
            'Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)',
            'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)',
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) QQBrowser/6.9.11079.201',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
            'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0'
        ]
        thisua = random.choice(uapools)
        header = {'User-Agent': thisua}
        current_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        current_time = '2019-04-4 18:50:22'
        end_time = '2019-04-4 00:08:00'  # 电影上映时间
        url = 'http://m.maoyan.com/mmdb/comments/movie/248172.json?_v_=yes&offset=0&startTime=' +current_time.replace(' ','%20')
     
        def start_requests(self):
            current_t = str(self.current_time)
            if current_t > self.end_time:
                try:
                    yield Request(self.url, headers=self.header, callback=self.parse)
                except Exception as error:
                    print('请求1出错-----' + str(error))
            else:
                print('全部有关信息已经搜索完毕')
     
        def parse(self, response):
            item = MaoyanItem()
            data = response.body.decode('utf-8''ignore')
            json_data = json.loads(data)['cmts']
            count = 0
            for item1 in json_data:
                if 'cityName' in item1 and 'nickName' in item1 and 'userId' in item1 and 'content' in item1 and 'score' in item1 and 'startTime' in item1 and 'userLevel' in item1:
                    try:
                        city = item1['cityName']
                        comment = item1['content']
                        user_id = item1['userId']
                        nick_name = item1['nickName']
                        score = item1['score']
                        time = item1['startTime']
                        user_level = item1['userLevel']
                        item['city'= city
                        item['content'= comment
                        item['user_id'= user_id
                        item['nick_name'= nick_name
                        item['score'= score
                        item['time'= time
                        item['user_level'= user_level
                        yield item
                        count += 1
                        if count >= 15:
                            temp_time = item['time']
                            current_t = datetime.datetime.strptime(temp_time, '%Y-%m-%d %H:%M:%S'+ datetime.timedelta(
                                seconds=-1)
                            current_t = str(current_t)
                            if current_t > self.end_time:
                                url1 = 'http://m.maoyan.com/mmdb/comments/movie/248172.json?_v_=yes&offset=0&startTime=' + current_t.replace(
                                    ' ''%20')
                                yield Request(url1, headers=self.header, callback=self.parse)
                            else:
                                print('全部有关信息已经搜索完毕')
                    except Exception as error:
                        print('提取信息出错1-----' + str(error))
                else:
                    print('信息不全,已滤除')
    pipelines文件
    import pandas as pd
    class MaoyanPipeline(object):
        def process_item(self, item, spider):
            dict_info = {'city': item['city'], 'content': item['content'], 'user_id': item['user_id'],
                         'nick_name': item['nick_name'],
                         'score': item['score'], 'time': item['time'], 'user_level': item['user_level']}
            try:
                data = pd.DataFrame(dict_info, index=[0])  # 为data创建一个表格形式 ,注意加index = [0]
                data.to_csv('G:info.csv', header=False, index=True, mode='a',
                            encoding='utf_8_sig')  # 模式:追加,encoding = 'utf-8-sig'
            except Exception as error:
                print('写入文件出错-------->>>' + str(error))
            else:
                print(dict_info['content'+ '---------->>>已经写入文件')
     最后爬完的数据1.34M,16732条数据。显示如下所示:
  • 相关阅读:
    bcftools 为 vcf 文件建索引及合并 vcf 文件
    Linux 替换^M字符方法
    shell 字符串分割方法简介
    shell 数组介绍及相关操作
    Annovar 信息注释
    C++ string与数值的转换
    C/C++ 删除文件 remove函数
    关于内核转储(core dump)的设置方法
    mac下nginx安装
    linux独有的sendfile系统调用--“零拷贝,高效”
  • 原文地址:https://www.cnblogs.com/068zhengda/p/10786833.html
Copyright © 2011-2022 走看看