zoukankan      html  css  js  c++  java
  • python之scrapy爬取jingdong招聘信息到mysql数据库

    1、创建工程

    scrapy startproject jd

    2、创建项目

    scrapy genspider jingdong

    3、安装pymysql

    pip install pymysql

    4、settings.py文件,主要是全局字段的定义,包括数据库信息

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for jd project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'jd'
    
    SPIDER_MODULES = ['jd.spiders']
    NEWSPIDER_MODULE = 'jd.spiders'
    
    LOG_LEVEL="WARNING"
    LOG_FILE="./jingdong1.log"
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'jd (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'jd.middlewares.JdSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'jd.middlewares.JdDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'jd.pipelines.JdPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    # 连接数据MySQL
    # 数据库地址
    MYSQL_HOST = 'localhost'
    # 数据库用户名:
    MYSQL_USER = 'root'
    # 数据库密码
    MYSQL_PASSWORD = 'yang156122'
    # 数据库端口
    MYSQL_PORT = 3306
    # 数据库名称
    MYSQL_DBNAME = 'test'
    # 数据库编码
    MYSQL_CHARSET = 'utf8'
    View Code

    5、items.py文件定义数据库字段

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class JdItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        appTime = scrapy.Field()
        applicantErp = scrapy.Field()
        formatPublishTime = scrapy.Field()
        jobType = scrapy.Field()
        positionName = scrapy.Field()
        positionNameOpen = scrapy.Field()
        publishTime = scrapy.Field()
        qualification= scrapy.Field()
        pass
    View Code

    6、jingdong.py文件主要是爬取所需数据

    # -*- coding: utf-8 -*-
    import scrapy
    
    import logging
    import json
    logger = logging.getLogger(__name__)
    class JingdongSpider(scrapy.Spider):
        name = 'jingdong'
        allowed_domains = ['zhaopin.jd.com']
        start_urls = ['http://zhaopin.jd.com/web/job/job_list?page=1']
        pageNum = 1
        def parse(self, response):
            content  = response.body.decode()
            content = json.loads(content)
            ##########去除列表中字典集中的空值###########
            for i in range(len(content)):
                #list(content[i].keys()获取当前字典中的key
                # for key in list(content[i].keys()): #content[i]为字典
                #     if not content[i].get(key):#content[i].get(key)根据key获取value
                #         del content[i][key] #删除空值字典
                yield content[i]
            # for i in range(len(content)):
            #     logging.warning(content[i])
    
            self.pageNum = self.pageNum+1
            if self.pageNum<=355:
                next_url = "http://zhaopin.jd.com/web/job/job_list?page="+str(self.pageNum)
                yield scrapy.Request(
                    next_url,
                    callback=self.parse
                )
            pass
    View Code

    7、pipelines.py文件主要是对爬取的数据进行清洗和处理,包括数据的入库操作

      这里和tencent相比,主要是增加了时间处理

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    import logging
    from pymysql import cursors
    from twisted.enterprise import adbapi
    import time
    import copy
    class JdPipeline(object):
        # 函数初始化
        def __init__(self, db_pool):
            self.db_pool = db_pool
    
        @classmethod
        def from_settings(cls, settings):
            """类方法,只加载一次,数据库初始化"""
            db_params = dict(
                host=settings['MYSQL_HOST'],
                user=settings['MYSQL_USER'],
                password=settings['MYSQL_PASSWORD'],
                port=settings['MYSQL_PORT'],
                database=settings['MYSQL_DBNAME'],
                charset=settings['MYSQL_CHARSET'],
                use_unicode=True,
                # 设置游标类型
                cursorclass=cursors.DictCursor
            )
            # 创建连接池
            db_pool = adbapi.ConnectionPool('pymysql', **db_params)
            # 返回一个pipeline对象
            return cls(db_pool)
    
        def process_item(self, item, spider):
            myItem = {}
            myItem["appTime"]=item["appTime"]
            myItem["applicantErp"] = item["applicantErp"]
            myItem["formatPublishTime"] = item["formatPublishTime"]
            myItem["jobType"] = item["jobType"]
            myItem["positionName"] = item["positionName"]
            #时间转换
            publishTime = item["publishTime"]
            publishTime = time.localtime(int(str(publishTime)[:10])) #时间格式转换
            myItem["publishTime"] = time.strftime("%Y-%m-%d %H:%M:%S", publishTime)
    
            myItem["positionNameOpen"]=item["positionNameOpen"]
            myItem["qualification"] = item["qualification"]
    
            logging.warning(item)
            # 对象拷贝,深拷贝  --- 这里是解决数据重复问题!!!
            asynItem = copy.deepcopy(myItem)
            # 把要执行的sql放入连接池
            query = self.db_pool.runInteraction(self.insert_into, asynItem)
            # 如果sql执行发送错误,自动回调addErrBack()函数
            query.addErrback(self.handle_error, myItem, spider)
            return myItem
    
            # 处理sql函数
        def insert_into(self, cursor, item):
            # 创建sql语句
            sql = "INSERT INTO jingdong (appTime,applicantErp,formatPublishTime,jobType,positionName,publishTime,positionNameOpen,qualification) " 
                  "VALUES ('{}','{}','{}','{}','{}','{}','{}','{}')".format(
                item['appTime'], item['applicantErp'],item['formatPublishTime'] , item['jobType'],
                item['positionName'], item['publishTime'], item['positionNameOpen'],item['qualification'])
            # 执行sql语句
            cursor.execute(sql)
    
            # 错误函数
        def handle_error(self, failure, item, spider):
            # #输出错误信息
            print("failure", failure)
    View Code

    完美收官!!!

  • 相关阅读:
    ubuntu 制做samba
    《Programming WPF》翻译 第4章 前言
    《Programming WPF》翻译 第4章 3.绑定到数据列表
    《Programming WPF》翻译 第4章 4.数据源
    《Programming WPF》翻译 第5章 6.触发器
    《Programming WPF》翻译 第4章 2.数据绑定
    《Programming WPF》翻译 第4章 1.不使用数据绑定
    《Programming WPF》翻译 第5章 7.控件模板
    《Programming WPF》翻译 第5章 8.我们进行到哪里了?
    《Programming WPF》翻译 第5章 5.数据模板和样式
  • 原文地址:https://www.cnblogs.com/ywjfx/p/11102845.html
Copyright © 2011-2022 走看看