zoukankan      html  css  js  c++  java
  • python爬取杭州市幼儿园信息


    一、爬取前准备

    1、IDE使用pycharm

    2、安装相关的库,requests,re,xlsxwritter,beautifulsoup

    3、分析杭州教育地图网页结构

    如图看到,网页由顶部的区域,中间的学校列表和底部的分页等几个重要的部分组成。查看网页源码,可以看到上述的三个部分都可以在页面中找到,不需要填写浏览器信息和cookie验证等.


    二、爬取信息

    1、引入相关库

    import requests
    import re
    import xlsxwriter
    from bs4 import BeautifulSoup

    2、获取请求

    def get_soup(url, param):
        response = requests.get(url, params=param)
        soup = BeautifulSoup(response.text, 'html.parser')
        return soup

        param为url后面的地址,在该网页中,不同地区、不同学校的信息都是通过参数的变化实现的,url的前面不变

    3、获取地区列表,保存在数组中

         分析地区的结构可知,所有的地区都保存在role属性为presentation的<li> 的<a>中

        传入的参数S为包含网页信息的Beautisoup 对象,遍历beautisoup查找的结果集,除“全部”外都保存在数组res_areas中

    def get_area(s):
        res_areas = []
        areas = s.find_all(name='li', attrs={"role": "presentation"})
        for area in areas:
            t = area.find('a').string
            if t != '全部':
                res_areas.append(t)
        return res_areas

    4、获取某地区学校的分页总数

    def get_page_num(s):
        r = s.find_all(name="div", attrs={"class": re.compile(r'page')})[0]
        if r.find("strong") is None:
            return 0
        else:
            n = r.find("strong").find_next_siblings()[0].get_text()
            return int(n)
    

        分页的子页和总数用<strong>包括,被外层div包含,很容易找到。这里做的判断为了避免出现某地区没有学校的现象,否则程序会报错。

    5、写主函数main,导出xlsx

    三、总结

    1、实现了快速提取所有杭州市幼儿园信息,节省了人力物力

    2、页面结构简单,提取相对容易

    附上源码

    import requests
    import re
    import xlsxwriter
    from bs4 import BeautifulSoup
    
    
    # 获取请求
    def get_soup(url, param):
        response = requests.get(url, params=param)
        soup = BeautifulSoup(response.text, 'html.parser')
        return soup
    
    
    # 获取某地区分页数
    def get_page_num(s):
        r = s.find_all(name="div", attrs={"class": re.compile(r'page')})[0]
        if r.find("strong") is None:
            return 0
        else:
            n = r.find("strong").find_next_siblings()[0].get_text()
            return int(n)
    
    
    # 获取url参数
    def get_param(grade, area, page):
        para = {'grade_type': '1', 'area_type': area, "page": page}
        return para
    
    
    # 获取地区
    def get_area(s):
        res_areas = []
        areas = s.find_all(name='li', attrs={"role": "presentation"})
        for area in areas:
            t = area.find('a').string
            if t != '全部':
                res_areas.append(t)
        return res_areas
    
    
    def main():
        url = "http://hzjiaoyufb.hangzhou.com.cn/school_list.php"
        soup = get_soup(url, {'grade_type': '1'})
        # 初始化xlsx
        print('初始化xlsx...')
        workbook = xlsxwriter.Workbook('school.xlsx')
        worksheet = workbook.add_worksheet()
        bold = workbook.add_format({'bold': True})
        worksheet.write('A1', '学校名称', bold)
        worksheet.write('B1', '学校地址', bold)
        worksheet.write('C1', '学校网址', bold)
        worksheet.write('D1', '学校电话', bold)
        worksheet.write('E1', '学校微信', bold)
        worksheet.write('F1', '学校微博', bold)
        worksheet.write('G1', '班级数目', bold)
        worksheet.write('H1', '学校类型', bold)
        worksheet.write('I1', '学校层次', bold)
        worksheet.write('J1', '地区', bold)
        # 根据地点和分页的遍历获取所有的子叶超链接,保存在arr中
        arr = []  # 存储链接地址
        area = []  # 存储地址信息
        school_name = []
        school_location = []
        school_website = []
        school_tel = []
        school_wx = []
        school_nature = []
        school_class = []
        school_pic = []
        school_wb = []
        school_type = []
        school_level = []
        print('获取所有区域...')
        for res_area in get_area(soup):
            soup = get_soup(url, get_param('1', res_area, '1'))
            for num in range(get_page_num(soup)):
                soup = get_soup(url, get_param('1', res_area, num - 1))
                schools = soup.find_all('div', class_="pInfo")
                for school in schools:
                    arr.append('http://hzjiaoyufb.hangzhou.com.cn/' + school.find('a').attrs['href'])
                    area.append(res_area)
        # 遍历arr中url,获取子叶信息
        print('获取所有学校数据...')
        for item in enumerate(arr):
            response = requests.get(item[1])
            soup = BeautifulSoup(response.text, 'html.parser')
            panel1 = soup.find('h2').text
            panel2 = soup.find_all(name='div', attrs='panel-body')
            school_name.append(panel1)
            array = []
            for panel in panel2:
                if panel.find('h6') is not None:
                    array.append(panel.find('h6').text.strip())
            school_location.append(array[1])
            school_website.append(array[4])
            school_tel.append(array[5])
            school_wx.append(array[6])
            school_wb.append(array[7])
            school_nature.append(array[8])
            school_type.append(array[9])
            school_level.append(array[10])
            school_class.append(array[11])
        row = 1
        print('写入elsx文件...')
        for i in range(len(school_name)):
            worksheet.write(row, 0, school_name[i])
            worksheet.write(row, 1, school_location[i])
            worksheet.write(row, 2, school_website[i])
            worksheet.write(row, 3, school_tel[i])
            worksheet.write(row, 4, school_wx[i])
            worksheet.write(row, 5, school_wb[i])
            worksheet.write(row, 6, school_class[i])
            worksheet.write(row, 7, school_type[i])
            worksheet.write(row, 8, school_level[i])
            worksheet.write(row, 9, area[i])
            row += 1
        workbook.close()
    
    
    if __name__ == '__main__':
        main()
    
  • 相关阅读:
    BUUOJ | [ACTF新生赛2020]usualCrypt (多重加密)
    高数笔记 | 快速索引 + 期末总结(2019-2020学年第二学期)
    BUUOJ | SimpleRev(字符对称加密)
    CTF OJ 题目分类 | Reverse
    CPPU程序设计训练营清明天梯模拟赛题解
    数据可视化 | 2020年3月世界疫情实存人数地图
    CTF OJ 题目分类 | PWN
    BJDCTF 2nd | Strenuous_Huffman(二进制模拟)
    ssh连接慢优化
    日常问题处理
  • 原文地址:https://www.cnblogs.com/asdlijian/p/13514192.html
Copyright © 2011-2022 走看看