zoukankan html css js c++ java

寒假大数据学习笔记十三

　　今天接着昨天，写出了一个crawlspider爬取山西省卫健委官网数据的小例子，当然依旧是json数据存储，并且也没有直接做成数据，只是字符串。

　　爬取的还算成功，但中间出了一点岔子：在最近两天的官网公布疫情感染人数上不再是写出来了，而是直接放图片！！！你说要是表格也就算了，山西省卫健委直接将一张图片上传上去了……这让我怎么分析数据？我上网找了找，倒是找见一个关于图像识别表格数据的博客，但明显对于我来说短时间之内是完不成的，这还真是够麻爪的……算了，一步步来，先把相关数据爬取下来，需要处理时再说，大不了换一个网站爬取嘛。代码如下：

spider.py:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class SxgzbdUpgradeSpider(CrawlSpider):
    name = 'sxgzbd_upgrade'
    allowed_domains = ['wjw.shanxi.gov.cn']
    start_urls = ['http://wjw.shanxi.gov.cn/yqfbl05/index.hrh']

    rules = (
        Rule(
            LinkExtractor(
                allow=r'.+yqfbl05/index(_d)?.hrh'),
            follow=True),
        Rule(
            LinkExtractor(
                allow=r'.+wjywl02/(d){5}.hrh'),
            callback="parse_detail",
            follow=False),
    )

    def parse_detail(self, response):
        item = {}
        contexts = response.xpath(
            "//div[@class='ze-art']//text()").getall()
        item['date'] = response.xpath(
            "//div[@class='artxx']/text()").re(r"2020-dd-dd")[0]
        item['contexts'] = "".join(contexts).strip()
        yield item

Items.py:

import scrapy


class SpiderDemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    date = scrapy.Field()
    contexts = scrapy.Field()

piplines.py:

from scrapy.exporters import JsonLinesItemExporter


class SpiderDemoPipeline(object):
    def __init__(self):
        self.fp = open("gzbddata.json", "wb")
        self.exporter = JsonLinesItemExporter(
            self.fp, ensure_ascii=False, encoding="utf-8")

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.fp.close()

记录我踩过的坑：

1、Setting中一定要开启ITEM_PIPELINES，把它的注释号删掉，不然是不会运行PIPELINE文件的，我快折腾了一个小时了，才猛然发现没开开PIPELINE，怪不得没有文件生成。

2、一定要测试好正则表达式有没有问题，不然从一开始就会什么数据都没有。

3、同样在Setting中，ROBOTSTXT_OBEY要False，既然想要学习爬虫，就一定不要安分守己（笑）。

查看全文

相关阅读:
3123123weq
123132w
Unable to execute dex: Multiple dex files define 解决方法
 Error: leaving XXXX; does not track upstream.
repo sync error: .repo/manifests/: contains uncommitted changes
Can't OPEN Eclipse
ERROR: Current shell is not /bin/bash, please check. Stop.
GIT_ERROR: gpg: Can't check signature: public key not found error: could not verify the tag 'v1.12.4'
GIT_Error: Agent admitted failure to sign —— Permission denied (publickey).
HTTP 与 SOAP 介绍与关系

原文地址：https://www.cnblogs.com/YXSZ/p/12307123.html

热门文章
常问java面试题
 Tomcat_7.x压缩版_环境变量配置(亲测有效)
11
123132w
123132w
123132w
123132w
123132w
123132w
123132w