zoukankan      html  css  js  c++  java
  • 新浪网分类资讯爬虫

          从GitHub得到完整项目(https://github.com/daleyzou/sinainfo.git)

    1、简介

    爬取新浪网导航页所有下所有大类、小类、小类里的子链接,以及子链接页面的新闻内容。

    效果演示图:

    sinaData[7]


    2、代码

    items.py

      1 import scrapy
      2 
      3 
      4 class SinainfoItem(scrapy.Item):
      5     # 大类的标题和url
      6     parentTitle = scrapy.Field()
      7     parentUrls = scrapy.Field()
      8 
      9     # 小类的标题和子url
     10     subTitle = scrapy.Field()
     11     subUrls = scrapy.Field()
     12 
     13     # 小类目录存储路径
     14     subFilename = scrapy.Field()
     15 
     16     # 小类下的子链接
     17     sonUrls = scrapy.Field()
     18 
     19     # 大文章标题和内容
     20     head = scrapy.Field()
     21     content = scrapy.Field()

    spiders/sina.py(爬虫)

      1 # -*- coding: utf-8 -*-
      2 import scrapy
      3 import sys
      4 import os
      5 
      6 # noinspection PyUnresolvedReferences
      7 from sinainfo.items import SinainfoItem
      8 
      9 reload(sys)
     10 sys.setdefaultencoding('utf-8')
     11 
     12 
     13 class SinaSpider(scrapy.Spider):
     14     name = 'sina'
     15     allowed_domains = ['sina.com.cn']
     16     start_urls = ['http://news.sina.com.cn/guide/']
     17 
     18     def parse(self, response):
     19         items = []
     20         # 所有大类的标题和url
     21         parentTitle = response.xpath("//div[@id='tab01']/div/h3/a/text()").extract()
     22         parentUrls = response.xpath('//div[@id="tab01"]/div/h3/a/@href').extract()
     23 
     24         # 所有小类的ur 和 标题
     25         subUrls = response.xpath('//div[@id="tab01"]/div/ul/li/a/@href').extract()
     26         subTitle = response.xpath('//div[@id="tab01"]/div/ul/li/a/text()').extract()
     27 
     28         # 爬取所有大类
     29         for i in range(0, len(parentTitle)):
     30             # 指定大类目录的路径和目录名
     31             parentFilename = "./Data/" + parentTitle[i]
     32             # 如果目录不存在,则创建目录
     33             if (not os.path.exists(parentFilename)):
     34                 os.makedirs(parentFilename)
     35 
     36             # 爬取所有小类
     37             for j in range(0, len(subUrls)):
     38                 item = SinainfoItem()
     39                 # 保存大类的title和urls
     40                 item['parentTitle'] = parentTitle[i]
     41                 item['parentUrls'] = parentUrls[i]
     42                 # 检查小类的url是否以同类别大类url开头,如果是返回Ture
     43                 if_belong = subUrls[j].startswith(item['parentUrls'])
     44                 # 如果属于本大类,将存储目录放在本大类下
     45                 if (if_belong):
     46                     subFilename = parentFilename + '/' + subTitle[j]
     47                     # 如果目录不存在,则创建目录
     48                     if (not os.path.exists(subFilename)):
     49                         os.makedirs(subFilename)
     50                     # 存储 小类url、title、和filename字段数据
     51                     item['subUrls'] = subUrls[j]
     52                     item['subTitle'] = subTitle[j]
     53                     item['subFilename'] = subFilename
     54                     items.append(item)
     55 
     56         # 发送每个小类url的Request请求,得到Response连同包含meta数据
     57                     # 一同交给回调函数second_parse()处理
     58         for item in items:
     59             yield scrapy.Request(url = item['subUrls'],
     60                                  meta={'meta_1':item}, callback=self.second_parse)
     61 
     62     # 对于返回的小类url,在进行递归请求
     63     def second_parse(self, response):
     64         # 提取每次Response的meta数据
     65         meta_1 = response.meta['meta_1']
     66         # 取出小类里所有字链接
     67         sonUrls = response.xpath('//a/@href').extract()
     68 
     69         items = []
     70         for i in range(0, len(sonUrls)):
     71             # 检查每个链接是否以大类url开头、以.shtml结尾,如果是返回True
     72             if_belong = sonUrls[i].endswith('.shtml') and sonUrls[i].startswith(
     73                 meta_1['parentUrls'])
     74             # 如果属于本大类,获取字段值放在同一个item下便于传输
     75             if (if_belong):
     76                 item = SinainfoItem()
     77                 item['parentTitle'] = meta_1['parentTitle']
     78                 item['parentUrls'] = meta_1['parentUrls']
     79                 item['subTitle'] = meta_1['subTitle']
     80                 item['subUrls'] = meta_1['subUrls']
     81                 item['subFilename'] = meta_1['subFilename']
     82                 item['sonUrls'] = sonUrls[i]
     83                 items.append(item)
     84 
     85         for item in items:
     86             yield scrapy.Request(url = item['sonUrls'],
     87                                  meta = {'meta_2':item}, callback=self.detail_parse)
     88 
     89     # 数据解析方法,获取文章标题和内容
     90     def detail_parse(self, response):
     91         item = response.meta['meta_2']
     92         content = ""
     93         head = response.xpath('//h1[@id="main_title"]/text()')
     94         content_list = response.xpath('//div[@id="artibody"]/p/text()').extract()
     95         # 将p标签里的文本内容合并到一起
     96         for content_one in content_list:
     97             content += content_one
     98         item['head'] = head
     99         item['content'] = content
    View Code

    pipelines.py

      1 class SinainfoPipeline(object):
      2     def process_item(self, item, spider):
      3         sonUrls = item['sonUrls']
      4 
      5         # 文件名为子链接url中间部分,并将/替换为_,保存为.txt
      6         filename = sonUrls[7:-6].replace('/', '_')
      7         filename += ".txt"
      8 
      9         fp = open(item['subFilename']+'/'+filename, 'w')
     10         fp.write(item['content'])
     11         fp.close()
     12         return item

    settings.py

      1 
      2 BOT_NAME = 'sinainfo'
      3 
      4 SPIDER_MODULES = ['sinainfo.spiders']
      5 NEWSPIDER_MODULE = 'sinainfo.spiders'
      6 
      7 LOG_LEVEL = 'DEBUG'
      8 # Crawl responsibly by identifying yourself (and your website) on the user-agent
      9 USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
     10 DOWNLOAD_DELAY = 3
     11 COOKIES_ENABLED = False
     12 
     13 ITEM_PIPELINES = {
     14    'sinainfo.pipelines.SinainfoPipeline': 300,
     15 }

    3、运行

    方法一:

    (1)在项目根目录下新建main.py文件,用于调试
    from scrapy import cmdline
    cmdline.execute('scrapy crawl sina'.split())
    
    (2)执行程序
    py2 main.py

    方法二:

    在命令行下:

    (1)切换到项目/sinainfo/sinainfo/spiders

    (2)执行 scrapy crawl sina

  • 相关阅读:
    redis运维手册
    grafana展示ES中的nginx日志-地图展示
    nginx针对yum安装nginx重编译
    K8S-yaml里初始化容器
    K8S-资源配置清单补充1
    K8S-资源配置清单详解
    Docker cp 提示“no space left on device”
    磁盘
    ansible 对文件内容的操作
    ansible 初始化系统分区格式化
  • 原文地址:https://www.cnblogs.com/daleyzou/p/8325990.html
Copyright © 2011-2022 走看看