zoukankan      html  css  js  c++  java
  • Python爬取招聘网站数据并做数据可视化

    基本开发环境

    · Python 3.6

    · Pycharm

    相关模块使用

    爬虫模块

    import requests
    import re
    import parsel
    import csv

    词云模块

    import jieba
    
    import wordcloud

    目标网页分析

     

    通过开发者工具可以看到,获取返回数据后,数据是在window_search_result_里面,可以使用正则匹配数据。如下所示:

    https://jobs.51job.com/beijing/120995776.html?s=01&t=0

    每一个招聘信息的详情页都是有对应的ID,只需要正则匹配提取ID值,通过拼接URL,然后再去招聘详情页提取招聘数据即可。

    response = requests.get(url=url, headers=headers)
    lis = re.findall('"jobid":"(d+)"', response.text)
    for li in lis:
        page_url = 'https://jobs.51job.com/beijing-hdq/{}.html?s=01&t=0'.format(li)

    虽然网站是静态网页,但是网页编码是乱码,在爬取的过程中需要转码。

    f = open('招聘.csv', mode='a', encoding='utf-8', newline='')
    csv_writer = csv.DictWriter(f, fieldnames=['标题', '地区', '工作经验', '学历', '薪资', '福利', '招聘人数', '发布日期'])
    csv_writer.writeheader()
    response = requests.get(url=page_url, headers=headers)
    response.encoding = response.apparent_encoding
    selector = parsel.Selector(response.text)
    title = selector.css('.cn h1::text').get()      # 标题
    salary = selector.css('div.cn strong::text').get()       # 薪资
    welfare = selector.css('.jtag div.t1 span::text').getall()       # 福利
    welfare_info = '|'.join(welfare)
    data_info = selector.css('.cn p.msg.ltype::attr(title)').get().split('  |  ')
    area = data_info[0]         # 地区
    work_experience = data_info[1]      # 工作经验
    educational_background = data_info[2]       # 学历
    number_of_people = data_info[3]     # 招聘人数
    release_date = data_info[-1].replace('发布', '')     # 发布日期
    all_info_list = selector.css('div.tCompany_main > div:nth-child(1) > div p span::text').getall()
    all_info = '
    '.join(all_info_list)
    dit = {
        '标题': title,
        '地区': area,
        '工作经验': work_experience,
        '学历': educational_background,
        '薪资': salary,
        '福利': welfare_info,
        '招聘人数': number_of_people,
        '发布日期': release_date,
    }
    csv_writer.writerow(dit)
    with open('招聘信息.txt', mode='a', encoding='utf-8') as f:
        f.write(all_info)

    以上步骤即可完成关于招聘的相关数据爬取

    简单粗略的数据清洗

    薪资待遇:

    content = pd.read_csv(r'D:pythondemo数据分析招聘招聘.csv', encoding='utf-8')
    salary = content['薪资']
    salary_1 = salary[salary.notnull()]
    salary_count = pd.value_counts(salary_1)

    学历要求:

    content = pd.read_csv(r'D:pythondemo数据分析招聘招聘.csv', encoding='utf-8')
    educational_background = content['学历']
    educational_background_1 = educational_background[educational_background.notnull()]
    educational_background_count = pd.value_counts(educational_background_1).head()
    print(educational_background_count)
    bar = Bar()
    bar.add_xaxis(educational_background_count.index.tolist())
    bar.add_yaxis("学历", educational_background_count.values.tolist())
    bar.render('bar.html')

    工作经验:

    content = pd.read_csv(r'D:pythondemo数据分析招聘招聘.csv', encoding='utf-8')
    work_experience = content['工作经验']
    work_experience_count = pd.value_counts(work_experience)
    print(work_experience_count)
    bar = Bar()
    bar.add_xaxis(work_experience_count.index.tolist())
    bar.add_yaxis("经验要求", work_experience_count.values.tolist())
    bar.render('bar.html')

    词云分析,技术点要求

    py = imageio.imread("python.png")
    f = open('python招聘信息.txt', encoding='utf-8')
    
    re_txt = f.read()
    result = re.findall(r'[a-zA-Z]+', re_txt)
    txt = ' '.join(result)
    
    # jiabe 分词 分割词汇
    txt_list = jieba.lcut(txt)
    string = ' '.join(txt_list)
    # 词云图设置
    wc = wordcloud.WordCloud(
            width=1000,         # 图片的宽
            height=700,         # 图片的高
            background_color='white',   # 图片背景颜色
            font_path='msyh.ttc',    # 词云字体
            mask=py,     # 所使用的词云图片
            scale=15,
            stopwords={' '},
            # contour_width=5,
            # contour_color='red'  # 轮廓颜色
    )
    # 给词云输入文字
    wc.generate(string)
    # 词云图保存图片地址
    wc.to_file(r'python招聘信息.png')
  • 相关阅读:
    如何用Tensorflow训练模型成pb文件和和如何加载已经训练好的模型文件
    hbase rowkey 设计
    hbase集群region数量和大小的影响
    为什么不建议在hbase中使用过多的列簇
    hive explode 行拆列
    通过livy向CDH集群的spark提交任务
    case when多条件
    spark sql/hive小文件问题
    SQL join
    spark任务调度模式,动态资源分配
  • 原文地址:https://www.cnblogs.com/Martinaoh/p/14403777.html
Copyright © 2011-2022 走看看