zoukankan      html  css  js  c++  java
  • Scrapy实战篇(二)之爬取链家网成交房源数据(下)

    在上一小节中,我们已经提取到了房源的具体信息,这一节中,我们主要是对提取到的数据进行后续的处理,以及进行相关的设置。

    数据处理

    我们这里以把数据存储到mongo数据库为例。
    编写pipelines.py文件

    import pymongo
    
    
    class MongoPipeline(object):
    
        collection = 'lianjia_house'  #数据库collection名称
    
        def __init__(self, mongo_uri, mongo_db):
            self.mongo_uri = mongo_uri
            self.mongo_db = mongo_db
    
        @classmethod
        def from_crawler(cls,crawler):
            return cls(
                mongo_uri = crawler.settings.get('MONGO_URI'),
                mongo_db = crawler.settings.get('MONGO_DB')
            )
        def open_spider(self,spider):
            self.client = pymongo.MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]
    
        def close(self, spider):
            self.client.close()
    
        def process_item(self, item, spider):
            table = self.db[self.collection]
            data = dict(item)
            table.insert_one(data)
            return item
    

    非常简单的几步,就实现了将数据保存到mongo数据库中,所以说mongo数据库还是非常好用的。
    由于之前的学习篇中已经学习过数据的存储相关的内容,在这里就不多赘述。

    设置随机User-Agent

    这个内容在之前的学习篇中也已经学习过了,这里就直接拿过来用。
    编写middlewares.py文件。

    import scrapy
    import random
    from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
    
    
    class MyUserAgentMiddleware(UserAgentMiddleware):
    
        def __init__(self, agents):
            self.agents = agents
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                agents=crawler.settings.get('USER_AGENTS')
            )
    
        def process_request(self, request, spider):
            agent = random.choice(self.agents)
            request.headers['User-Agent'] = agent
    

    设置(settings)

    最后一步就是在settings.py文件中,进行我们的设置和应用我们的相关的组件。
    内容如下:

    BOT_NAME = 'lianjia'
    
    SPIDER_MODULES = ['lianjia.spiders']
    NEWSPIDER_MODULE = 'lianjia.spiders'
    
    ROBOTSTXT_OBEY = False
    
    USER_AGENTS = [
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
        "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
        "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
        "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
        "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
        "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
        "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
        "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
        "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
        ]
    
    MONGO_URI = 'mongodb://localhost:27017'
    MONGO_DB = "lianjia"
    
    DOWNLOAD_DELAY = 2
    
    DEFAULT_REQUEST_HEADERS = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
        'Connection':'keep-alive'
    }
    
    DOWNLOADER_MIDDLEWARES = {
        'lianjia.middlewares.MyUserAgentMiddleware': 400,
    }
    
    ITEM_PIPELINES = {
       'lianjia.pipelines.MongoPipeline': 300,
    }
    

    总结

    由于我们爬取得数据量比较大,请求比较多,如果你直接运行的话,肯定是很快就会被封掉的,你可以选择设置ip代理,具体的设置方法你可以参照scrapy学习篇里面的设置ip代理,这里就不多演示,当然了,如果你想看一下效果的话,你可以选择只爬取某一个区的数据,比如鼓楼区。其效果如下面所示。

    另外,你可以在你的项目根目录下创建一个run.py文件,里面添加如下的内容:

    from scrapy import cmdline
    cmdline.execute("scrapy crawl lianjia".split())
    

    其中,lianjia是你spider里面定义的名字,这样,你只需要使用python run.py就可以运行这个项目了。

    这里提醒一下,如果你不是着急获取这个数据的话,可以将设置里面的下载延迟设置的稍微大一些,一方面防止我们爬虫被办,另一方面以减轻对方服务器的压力。

    github地址: https://github.com/cnkai/lianjia.git

    声明:本文仅供学习交流之用。

  • 相关阅读:
    node.js 安装后怎么打开 node.js 命令框
    thinkPHP5 where多条件查询
    网站title中的图标
    第一次写博客
    Solution to copy paste not working in Remote Desktop
    The operation could not be completed. (Microsoft.Dynamics.BusinessConnectorNet)
    The package failed to load due to error 0xC0011008
    VS2013常用快捷键
    微软Dynamics AX的三层架构
    怎样在TFS(Team Foundation Server)中链接团队项目
  • 原文地址:https://www.cnblogs.com/cnkai/p/7405349.html
Copyright © 2011-2022 走看看