zoukankan      html  css  js  c++  java
  • 寒假学习14

    经过几天的学习,今天终于完成了首都之窗的数据爬取,现在进行一下总结:

    首都之窗的爬取我进行里两步:

    一,使用selenium模拟浏览器翻页,爬取列表页上的信息,主要是各个详情页的url(详细说明请看上篇博客)

    spider.py

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from bjnew.items import BjnewItem
     4 
     5 class SdzcSpider(scrapy.Spider):
     6     name = 'sdzc'
     7     def start_requests(self):
     8         urls = [
     9             'http://www.beijing.gov.cn/hudong/hdjl/com.web.search.mailList.flow'
    10         ]
    11         for url in urls:
    12             yield scrapy.Request(url=url, meta={'type': '0'}, callback=self.parse_page, dont_filter=True)
    13 
    14     def parse_page(self, response):
    15 
    16         print(response.flags)
    17         if int(response.flags[0]) == 0:
    18             return
    19         items = response.xpath('//*[@id="mailul"]/div')
    20         print(response.xpath('//*[@id="page_span2"]/span[2]/text()').extract_first().strip())
    21         for it in items:
    22             item = BjnewItem()
    23             title = it.xpath('./div[1]/a/span/text()').extract_first().strip()
    24             print(title)
    25             zl0 = it.xpath('./div[1]/a/font/text()').extract_first().strip('·【')
    26             zl = zl0.strip('')
    27             # print(zl)
    28             info0 = it.xpath('./div[1]/a/@onclick').extract_first()
    29             if len(info0) == 34:
    30                 info = info0[19:32]
    31             else:
    32                 info = info0[19:27]
    33             # print(info)
    34             state = it.xpath('./div/div/div/div/text()').extract_first().strip()
    35             # print(state)
    36             if zl == "咨询":
    37                 url = "http://www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId=" + info
    38             elif zl == "建议":
    39                 url = "http://www.beijing.gov.cn/hudong/hdjl/com.web.suggest.suggesDetail.flow?originalId=" + info
    40             elif zl == "投诉":
    41                 url = "http://www.beijing.gov.cn/hudong/hdjl/com.web.complain.complainDetail.flow?originalId=" + info
    42             print(url)
    43             item['url'] = url
    44             item['title'] = title
    45             item['zl'] = zl
    46             item['info'] = info
    47             item['state'] = state
    48             yield item
    49         yield scrapy.Request(url=response.url, meta={'type': '1'}, callback=self.parse_page, dont_filter=True)
    View Code

    二、仍然使用scrapy框架进行爬取,从昨天爬取的json文件中读取各个详情页的url进行爬取,

    几点说明:

    1、昨天爬了5600多页,3万多条的数据,但是今天查看网页发现少了几十页,应该是网站删除了,所以在代码中做了以下判断

    1 # 通过title判断网页是否存在
    2  if pan == "-首都之窗-北京市政务门户网站":
    3             yield
    View Code

    2、在爬取过程中,出现了以下错误

    原因是这个网页,只有提问,没有回答,本想写一个判断将他爬下来,但是在列表页里找不到该网页,原因是只是删除了列表页里的连接,但是原网页并没有删除,所以也就没有爬取

    最后完整的代码:

      1 # -*- coding: utf-8 -*-
      2 import scrapy
      3 import json
      4 from beijing4.items import Beijing4Item
      5 
      6 class BjSpider(scrapy.Spider):
      7 
      8     name = 'bj'
      9     allowed_domains = ['beijing.gov.cn']
     10     # 从文件里读取详情页的url
     11     def start_requests(self):
     12         with open('F:/Users/Lenovo/json/info.json', 'r', encoding='utf8')as fp:
     13             json_data = json.load(fp)
     14             # print(json_data)
     15             for i in range(len(json_data)):
     16                 yield scrapy.Request(url=json_data[i]['url'])
     17     # 详情页解析方法
     18     def parse(self, response):
     19         item = Beijing4Item()
     20         # 获取网页title,判断网页是否存在
     21         pan = response.xpath('/html/head/title/text()').extract_first()
     22         # print(pan)
     23         if pan == "-首都之窗-北京市政务门户网站":
     24             yield
     25         # 由于不同类型网页解析不一样
     26         else:
     27             if response.url.find("com.web.suggest.suggesDetail.flow") >= 0:
     28                 title = response.xpath("/html/body/div[2]/div/div[2]/div[1]/div[1]/div[2]/strong/text()").extract_first().strip()
     29                 print(title)
     30                 question = response.xpath('/html/body/div[2]/div/div[2]/div[1]/div[3]/text()').extract_first().strip()
     31                 print(question)
     32                 questioner0 = response.xpath('/html/body/div[2]/div/div[2]/div[1]/div[2]/div[1]/text()').extract_first().strip('来信人: ')
     33                 questioner = questioner0.strip()
     34                 print(questioner)
     35                 q_time = response.xpath('/html/body/div[2]/div/div[2]/div[1]/div[2]/div[2]/text()').extract_first().strip('时间:')
     36                 print(q_time)
     37                 s_q_num = response.xpath('/html/body/div[2]/div/div[2]/div[1]/div[2]/div[3]/label/text()').extract_first().strip()
     38                 print(s_q_num)
     39                 answerer0 = response.xpath('/html/body/div[2]/div/div[2]/div[2]/div/div[1]/div[1]/div[2]/text()').extract()
     40                 answerer = ''.join([r.strip() for r in answerer0])
     41                 print(answerer)
     42                 a_time = response.xpath('/html/body/div[2]/div/div[2]/div[2]/div/div[1]/div[1]/div[3]/text()').extract_first().strip('答复时间:')
     43                 print(a_time)
     44                 answer = ''.join(response.xpath('/html/body/div[2]/div/div[2]/div[2]/div/div[1]/div[2]').xpath('string(.)').extract_first().strip().split())
     45                 print(answer)
     46                 item['kind'] = '建议'
     47                 item['title'] = title
     48                 item['question'] = question
     49                 item['questioner'] = questioner
     50                 item['q_time'] = q_time
     51                 item['s_q_num'] = s_q_num
     52                 item['answerer'] = answerer
     53                 item['a_time'] = a_time
     54                 item['answer'] = answer
     55                 yield item
     56 
     57             elif response.url.find("com.web.consult.consultDetail.flow") >= 0:
     58                 title = response.xpath('//*[@id="f_baner"]/div[1]/div[1]/div[2]/strong/text()').extract_first().strip()
     59                 print(title)
     60                 question = response.xpath('//*[@id="f_baner"]/div[1]/div[3]/text()').extract_first().strip()
     61                 # print(question)
     62                 questioner0 = response.xpath('//*[@id="f_baner"]/div[1]/div[2]/div[1]/text()').extract_first().strip(
     63                     '来信人: ')
     64                 questioner = questioner0.strip()
     65                 # print(questioner)
     66                 q_time = response.xpath('//*[@id="f_baner"]/div[1]/div[2]/div[2]/text()').extract_first().strip('时间:')
     67                 # print(q_time)
     68                 s_q_num = response.xpath('//*[@id="f_baner"]/div[1]/div[2]/div[3]/label/text()').extract_first().strip()
     69                 # print(s_q_num)
     70                 answerer0 = response.xpath('//*[@id="f_baner"]/div[2]/div/div[1]/div[1]/div[2]/text()').extract()
     71                 answerer = ''.join([r.strip() for r in answerer0])
     72                 # print(answerer)
     73                 a_time = response.xpath('//*[@id="f_baner"]/div[2]/div/div[1]/div[1]/div[3]'
     74                                         '/text()').extract_first().strip('答复时间:')
     75                 # print(a_time)
     76                 answer = ''.join(response.xpath('//*[@id="f_baner"]/div[2]/div/div[1]/div[2]').xpath('string(.)').extract_first().strip().split())
     77                 print(answer)
     78                 item['kind'] = '咨询'
     79                 item['title'] = title
     80                 item['question'] = question
     81                 item['questioner'] = questioner
     82                 item['q_time'] = q_time
     83                 item['s_q_num'] = s_q_num
     84                 item['answerer'] = answerer
     85                 item['a_time'] = a_time
     86                 item['answer'] = answer
     87                 yield item
     88 
     89             elif response.url.find("com.web.complain.complainDetail.flow") >= 0:
     90                 title = response.xpath('//*[@id="Dbanner"]/div[1]/div[1]/div[2]/strong/text()').extract_first().strip()
     91                 print(title)
     92                 question = response.xpath('//*[@id="Dbanner"]/div[1]/div[3]/text()').extract_first().strip()
     93                 print(question)
     94                 questioner0 = response.xpath('//*[@id="Dbanner"]/div[1]/div[2]/div[1]/text()').extract_first().strip(
     95                     '来信人: ')
     96                 questioner = questioner0.strip()
     97                 print(questioner)
     98                 q_time = response.xpath('//*[@id="Dbanner"]/div[1]/div[2]/div[2]/text()').extract_first().strip('时间:')
     99                 print(q_time)
    100                 s_q_num = response.xpath('//*[@id="Dbanner"]/div[1]/div[2]/div[3]/label/text()').extract_first().strip()
    101                 print(s_q_num)
    102                 answerer0 = response.xpath('//*[@id="Dbanner"]/div[2]/div[1]/div[1]/div[1]/div[2]/span/text()').extract()
    103                 answerer = ''.join([r.strip() for r in answerer0])
    104                 print(answerer)
    105                 a_time = response.xpath('//*[@id="Dbanner"]/div[2]/div[1]/div[1]/div[1]/div[3]/text()').extract_first().strip('答复时间:')
    106                 print(a_time)
    107                 answer = ''.join(response.xpath('//*[@id="Dbanner"]/div[2]/div[1]/div[1]/div[2]').xpath('string(.)').extract_first().strip().split())
    108                 print(answer)
    109                 item['kind'] = '投诉'
    110                 item['title'] = title
    111                 item['question'] = question
    112                 item['questioner'] = questioner
    113                 item['q_time'] = q_time
    114                 item['s_q_num'] = s_q_num
    115                 item['answerer'] = answerer
    116                 item['a_time'] = a_time
    117                 item['answer'] = answer
    118                 yield item
    View Code

    爬取阶段的任务就完成了,因为在将数据存到json文件中时,就对数据格式进行了设置所以数据清洗就简单多了,最后的数据格式如下:

    还有一点就是,昨天爬取了5600个网页,花了4个多小时,今天爬了3万多个网页花了1个小时左右,想来原因是用了selenium的原因。

    接下来就是图形展示了。

  • 相关阅读:
    jmeter压力测试
    反射【类Class、成员变量Field、方法Method】
    模块十 python标准库
    模块五 python常用数据结构
    模块四 python函数
    模块三 python控制流语法
    模块二 python基本数据类型与操作
    第四章后总结文档
    第六章节练习
    第五章节练习
  • 原文地址:https://www.cnblogs.com/KYin/p/12275019.html
Copyright © 2011-2022 走看看