zoukankan html css js c++ java

爬取某招聘网站特殊字段的信息

举例：爬取带有“数据分析”职位的信息

需要材料：Python、

pip install requests
pip install lxml
pip install pandas

第一步：引用

import time
import requests
from lxml import etree
import pandas as pd
from pandas import DataFrame
from pandas import Series

第二步：找到可打开的正确的url

url = 'http://search.51job.com/list/080200,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=’

第三步：向这个URL发出请求，文本编码修改为中文字符

head = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'}
s = requests.Session()
res = s.get(url,headers = head)
res.encoding = 'gbk'
root = etree.HTML(res.text)

第四步：在原网址查看所需字段的位置

读取所需数据

position = root.xpath('//div[@class="el"]/p/span/a/@title')
company = root.xpath('//div[@class="el"]/span[@class="t2"]/a/@title')
place = root.xpath('//div[@class="el"]/span[@class="t3"]/text()')
salary = root.xpath('//div[@class="el"]/span[@class="t4"]/text()')
date = root.xpath('//div[@class="el"]/span[@class="t5"]/text()')
job = DataFrame([position,company,place,salary,date]).T
job.columns = ['position','company','place','salary','date']
job['page'] = 1
job.head()

测试运行

封装函数【要找到自增规律】

def Crawler_51job(df_origin,n):
    head = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'}
    s = requests.Session()
    for i in range(1,n+1):
        url = '

'http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=000000%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=99&keyword=%E4%BF%A1%E7%94%A8%E7%AE%A1%E7%90%86&keywordtype=2&curr_page=' + str(i) + '&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&fromType=14&dibiaoid=0&confirmdate=9'

'
        res = s.get(url,headers = head)
        res.encoding = 'gbk'
        root = etree.HTML(res.text)
        position = root.xpath('//div[@class="el"]/p/span/a/@title')
        company = root.xpath('//div[@class="el"]/span[@class="t2"]/a/@title')
        place = root.xpath('//div[@class="el"]/span[@class="t3"]/text()')
        salary = root.xpath('//div[@class="el"]/span[@class="t4"]/text()')
        date = root.xpath('//div[@class="el"]/span[@class="t5"]/text()')
        df = DataFrame([position,company,place,salary,date]).T
        df.columns = ['position','company','place','salary','date']
        df = DataFrame([position,company,place,salary,date]).T
        df.columns = ['position','company','place','salary','date']
        df['page'] = i
        time.sleep(2)
        df_origin = pd.concat([df_origin,df])
    return(df_origin)

函数调用与数据输出

job = Crawler_51job(job,17)

job = Crawler_51job(job,17)
job.to_csv('51job.csv',index = 0)

结果（excel坏了，用记事本查看一下）

这样就完成了一个爬虫小程序

查看全文

相关阅读:
生成指定范围的随机数
 sql
map的使用
 基础03 JVM到底在哪里？
Elasticsearch6.1.0 TransportClient聚合查询索引中所有数据
 Elasticsearch6.1.0 TransportClient滚动查询索引中所有数据写入文件中
 elasticsearch-java api中get() 和execute().actionGet()方法
 Elasticsearch6(Transport Client)常用操作
 Reflections反射获取注解下类
 Ambari2.6.0 安装HDP2.6.3: Python script has been killed due to timeout after waiting 300 secs

原文地址：https://www.cnblogs.com/keepgoingon/p/7110756.html