zoukankan      html  css  js  c++  java
  • 用scrapy 爬虫框架读取统计局网站的行政区划(备忘记录)

    不知不觉养成了一个习惯:完成一个工作或学习新知识后,要及时整理,否则过一段时间就忘记了。

    下面是用scrapy 爬虫框架读取行政区划的记录

    1. SelectorGadget 是个好东西,下载和安装可以网上查

      安装后,会在crome浏览器右上角有个图标。

     

     

    点击 右上角这个图标后,进入css选取模式, (1)点击网页内容,被选取内容显示黄色,同时css选择器会显示在控制面板上。(2)再点击黄色内容,会变红色,表示排除这一项标签。

    如图: 表示选取了各省,同时,排除了“京icp备...”

    2. 利用scrapy框架爬虫

     

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy import Request
    from ..items import *
    from scrapy import  Request
    from scrapy.http import  Request
    class JgSpider(scrapy.Spider):
        name = 'jgspider'
        allowed_domains = ['stats.gov.cn']
        start_urls = ['http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/index.html']
    
        def parse(self, response):
    
            print('begin-----------')
            # 省级
            for node in response.css('.provincetr a'):
                item = Class1_Item()
                item['name']= node.css('a::text').get()
                next_page=node.css('a::attr(href)').get()
                item['code']=next_page.split('.')[0]
                yield item
    
                if next_page:
    
                    next_page = response.urljoin(next_page)
                    yield scrapy.Request(next_page, callback=self.parse2)
    
    
        def parse2(self, response):
            # 市级
            for node in response.css('.citytr'):
                item = Class1_Item()
                item['code'] = node.css('a::text').getall()[0]
                item['name'] = node.css('a::text').getall()[1]
                next_page=node.css('a::attr(href)')[0].get()
    
                yield item
                if next_page:
                    next_page = response.urljoin(next_page)
                    yield scrapy.Request(next_page, callback=self.parse3)
    
        def parse3(self, response):
            # 县级
            for node in response.css('tr.countytr'):
                item = Class1_Item()
                if node.css('td::text').get() !=None:
                    item['code'] = node.css('td ::text').getall()[0]  # 要提取text,html tag必须探底
                    item['name'] = node.css('td ::text').getall()[1]
                else:
                    item['code'] = node.css('td a::text').getall()[0]#要提取text,html tag必须探底
                    item['name'] = node.css('td a::text').getall()[1]
                yield item
    
                next_page = node.css('a::attr(href)').get()
                if next_page:
                    pass

    我只爬到区县一级,没再读取乡村镇社区。重要的内容都下载备注。数据如下:

    3. 认清本质,快速理解,才能快速上手

    参考:https://djangoadventures.com/crawling-pages-with-scrapy/

    1.Request & Response

    import scrapy
    
    def parse(response):
        # do something with the response
        pass
    
    # making a request with a callback that we've defined above
    scrapy.Request('http://example.com', callback=parse)

    (1)向 Request 传入url 并发起请求

    (2)收到 Response object ,Response作为参数,传入callback函数

    (3) A callback is a function which will be invoked by the Request object once the request has finished its job. The result of the request’s job is a Response object which is passed to the callback function. A callback is sometimes called a handler.

       Response is basically a string with the HTML code of the page we’ve requeste

    Response其实就是html字符串,重点内容是parse这个字符串,提取我们需要的内容

    2.Selectors

     这是重点内容,我单独开篇:css选择器   https://www.cnblogs.com/lxgbky/p/12697801.html

     3.Spider

    爬虫主体逻辑,在这里,发起requests、解析 responses. The main method in a Spider class is start_requests.

    (1)是爬虫的入口 — when it is invoked, it launches the start_requests function. This function usually contains your initial Request invocations.

    (2)callback functions中,可以解析出url,进一步request

     like a tree with a root at start_requests

     start_requests

               sub requests

                        requests

                        requests

               sub requests

                        requests

                       requests

    class ExampleSpider(scrapy.Spider):
    
        def start_requests(self):
            # invoking initial request
            yield scrapy.Request('http://example.com', self.parse)
    
        def parse(self, response):
            # parsing response from the initial request
            # collecting links
            links = response.css('a.title::attr(href)').extract()
            for link in links:
                # make a request for each link we've collected
                # the handler is different from the one we had
                # in initial request
                yield scrapy.Request(link, self.parse_page)
    
        def parse_page(self, response):
            # parsing response from each link
            title = response.css('h1.title::text').extract()
            content = response.css('div.content p').extract()
    
            # returning structured data for further processing
            yield {'title': title, 'content': content}

    框架流程非常共通化,所以可以框架提供了简化写法,如下:

    This simple parser scenario is so common that there is a shortcut to reduce the boilerplate. Here’s a reduced example:

    class SimpleSpider(scrapy.Spider):
        # those are the initial urls that you used to write
        # in a start_requests method. A request will be invoked
        # for each url in this list
        start_urls = ['http://example.com']
    
        # this method will be called automatically for each
        # finished request from the start_urls.
        def parse(self, response):
            # you can either parse the response and return the data
            # or you can collect further urls and create additional
            # handlers for them, like we did with parse_page previously

    省略了start_requests, 默认的start_requests自动调用start_urls,并回调传参给parse

    4. Pipeline

    pipeline是最后清理垃圾,保存数据的地方。有3个函数

    (1) process_item method which is invoked for each piece of data returned from the spider.

              In this function we usually clean the data or save it to a database.

    (2) open_spider and close_spider.

    方法的使用说明看代码注释

    from markparser.storage import get_session, Place
    class MarkparserPipeline(object):
    
    def open_spider(self, spider):
        # this method is invoked once the spider
        # is initialized. No requests have been
        # made at this point yet
        self.session = get_session()
    
    def close_spider(self, spider):
        # this method is invoked when the spider
        # is about to exit. All requests have been
        # made already.
        self.session.close()
    
    def process_item(self, item, spider):
        # here we place our item processing logic
        # we can either modify our data and pass it on
        # for further processing or we can save this
        # item to a database and finish the execution
        record = Place(**item)
        self.session.add(record)
        self.session.commit()
        return item

     下面是下载行政区划的pipline, 与django结合,使用django的model保存生成的item.

    from jgapp.models import Jg
    
    class ScrapyProjectPipeline(object):
        def process_item(self, item, spider):
            try:
                jg = Jg.objects.get(p_code=item['code'])
                print("jg already exist")
                return item
            except Jg.DoesNotExist:
                pass
    
            jg = Jg()
            jg.p_code = item["code"]
            jg.name = item["name"]
            jg.save()
            return item

    scrapy和django结合的关键是,在scrapy的setting.py中,加入启动django的代码

    import os
    import sys
    import django
    
    sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), ".."))
    sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
    os.environ['DJANGO_SETTINGS_MODULE'] = 'django_project.settings'
    
    django.setup()

    完!

  • 相关阅读:
    C# httpclient获取cookies实现模拟web登录
    C# httpclient获取cookies实现模拟web登录
    长连接与短连接的区别以及使用场景
    长连接与短连接的区别以及使用场景
    vuejs项目性能优化总结
    vuejs项目性能优化总结
    C# 发送HTTP请求(可加入Cookies)
    C# 发送HTTP请求(可加入Cookies)
    集合框架系列教材 (十六)- 其他
    集合框架系列教材 (十五)- 关系与区别
  • 原文地址:https://www.cnblogs.com/lxgbky/p/12874133.html
Copyright © 2011-2022 走看看