zoukankan      html  css  js  c++  java
  • scrapy实战--爬取最新美剧

    现在写一个利用scrapy爬虫框架爬取最新美剧的项目。

    准备工作:

      目标地址:http://www.meijutt.com/new100.html

      爬取项目:美剧名称、状态、电视台、更新时间

    1、创建工程目录

    mkdir scrapyProject
    cd scrapyProject
    

      

    2、创建工程项目

    scrapy startproject meiju100
    cd meiju100
    scrapy genspider meiju meijutt.com
    

    3、查看目录结构

      

    4、设置爬取项目(items.py)

    #
    # See documentation in:
    # http://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class Meiju100Item(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        storyName = scrapy.Field()
        storyState = scrapy.Field()
        tvStation = scrapy.Field()
        updateTime = scrapy.Field()
    

      

    5、编写爬取脚本(meiju.py)

    # -*- coding: utf-8 -*-
    import scrapy
    from meiju100.items import Meiju100Item
    
    class MeijuSpider(scrapy.Spider):
        name = "meiju"
        allowed_domains = ["meijutt.com"]
        start_urls = ['http://www.meijutt.com/new100.html']
    
        def parse(self, response):
            items = []
            subSelector = response.xpath('//ul[@class="top-list  fn-clear"]/li')
            for sub in subSelector:
                item = Meiju100Item()
                item['storyName'] = sub.xpath('./h5/a/text()').extract()
                item['storyState'] = sub.xpath('./span[1]/font/text()').extract()
                if item['storyState']:
                    pass
                else:
                    item['storyState'] = sub.xpath('./span[1]/text()').extract()
                item['tvStation'] = sub.xpath('./span[2]/text()').extract()
                if item['tvStation']:
                    pass
                else:
                    item['tvStation'] = [u'未知']
                item['updateTime'] = sub.xpath('./div[2]/text()').extract()
                if item['updateTime']:
                    pass
                else:
                    item['updateTime'] = sub.xpath('./div[2]/font/text()').extract()
                items.append(item)
            return items
    

      

    6、对爬取结果的处理

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import time
    import sys
    reload(sys)
    sys.setdefaultencoding('utf8')
    
    
    class Meiju100Pipeline(object):
        def process_item(self, item, spider):
            today = time.strftime('%Y%m%d',time.localtime())
            fileName = today + 'movie.txt'
            with open(fileName,'a') as fp:
                fp.write(item['storyName'][0].encode("utf8") + '	' + item['storyState'][0].encode("utf8") + '	' + item['tvStation'][0] + '	' + item['updateTime'][0] + '
    ')
            return item
    

      

    7、设置settings.py

    ……
    ITEM_PIPELINES = {'meiju100.pipelines.Meiju100Pipeline':1}
    

      

    8、启动爬虫 

    scrapy crawl meiju
    

      

    9、结果

    10、代码下载

     http://files.cnblogs.com/files/kongzhagen/meiju100.zip

  • 相关阅读:
    debug: if (va=1) 和 if (va==1)的区别
    必定管用!关于如何在电脑上恢复itunes中的音乐库(Mac Windows统统适用)
    关于Matlab中自动生成的asv文件
    LaTeX表格字体大小控制
    ANSYS中应力强度因子与J积分的计算
    Notepad++取消输入单词时出现提示
    notepad++的apdl高亮xml文件
    Power Graphics
    19个必须知道的Visual Studio快捷键
    APDL命令流:将ansys分析结果输出为tecplot格式
  • 原文地址:https://www.cnblogs.com/kongzhagen/p/6402335.html
Copyright © 2011-2022 走看看