zoukankan      html  css  js  c++  java
  • Python爬取智联招聘的GIS开发信息(初级版本)

    一、闲着无聊,每天都是那么无聊,感觉我算是废了,真坑想出去实习那么坑

    二、直接放源代码(写的烂不要喷我)

    import re#正则 
    import time
    import requests
    import random
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import urllib
    header={
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Content-Length': '1599',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'Host': 'hotel.elong.com',
    'Origin': 'http://hotel.elong.com',
    'Pragma': 'no-cache',
    'Referer': 'http://hotel.elong.com/beijing/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest'}
    for n in range(1,6):
        dat={'jl': '北京+上海+广州+深圳+西安',
              'kw': 'GIS开发',
              'p':n,
              'sm': '0',
              'sg': 'a20b0e245eac4aa1b4844dca099fc75a'}
        url='https://sou.zhaopin.com/jobs/searchresult.ashx?'
        print(n)
        html= requests.get(url,params=dat).text
            # 正则表达式进行解析
        pattern1 = re.findall('href="http://jobs.zhaopin.com/.*?.htm" target="_blank">(.*?)</a>',html )
        #pattern=[]
        #for x in pattern1:
        #    if ('<b>'in x)or('</b>'in x):
        #        strt=str(x)
        #        ss=strt.replace('</b>','')
        #        pattern.append(ss)
        #    else:
        #        pattern.append(x)
     
        pattern2 = re.findall('class="gsmc"><a href="http://company.zhaopin.com/.*?.htm" target="_blank">(.*?)</a>',html)
        pattern3 = re.findall( 'class="zwyx">(.*?)</td>',html)
        pattern4 = re.findall('class="newlist_deatil_two"><span>(.*?)</span>',html)
        print(pattern1)
        print('***********************')
        #print(ss)
        print(len(pattern1))
        data=list(map(lambda x:(pattern1[x],pattern2[x],pattern3[x],pattern4[x]),range(20)))
        data2=pd.DataFrame(data)
        data2.to_csv('C:\Users\你若成风618\Desktop\aa\2.csv',header=False,index=False,mode='a+')

    三、感觉很烂,后面准备优化一下,整点图表,数据分析下

    四、忘了配个图

  • 相关阅读:
    hive on tez 异常
    mysql 集群异常
    Linux abrt-hook-ccpp使用CPU和内存太多,导致其他ambari server 服务启动时报内存溢出
    Ambari 配置kerberos以后,,启动hiveserver2异常
    spring-data-jpa更新数据InvalidDataAccessApiUsageException:Executing an update/delete query
    mac软件思维导图(2020-10-11)
    Error: Invalid or corrupt jarfile
    springboot使用xml配置dubbo读取yml占位符
    ElasticSearch批量写入时遇到EsRejectedExecutionException
    ZYNQ Linux 移植:包含petalinux移植和手动移植debian9
  • 原文地址:https://www.cnblogs.com/tuboshu/p/10752375.html
Copyright © 2011-2022 走看看