zoukankan      html  css  js  c++  java
  • 爬取某招聘网站特殊字段的信息

    举例:爬取带有“数据分析”职位的信息

    需要材料:Python、

    pip install requests
    pip install  lxml 
    pip install  pandas 

    第一步:引用

    import time
    import requests
    from lxml import etree
    import pandas as pd
    from pandas import DataFrame
    from pandas import Series

    第二步:找到可打开的正确的url

    url = 'http://search.51job.com/list/080200,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=’

    第三步:向这个URL发出请求,文本编码修改为中文字符

    head = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'}
    s = requests.Session()
    res = s.get(url,headers = head)
    res.encoding = 'gbk'
    root = etree.HTML(res.text)

    第四步:在原网址查看所需字段的位置

    读取所需数据

    position = root.xpath('//div[@class="el"]/p/span/a/@title')
    company = root.xpath('//div[@class="el"]/span[@class="t2"]/a/@title')
    place = root.xpath('//div[@class="el"]/span[@class="t3"]/text()')
    salary = root.xpath('//div[@class="el"]/span[@class="t4"]/text()')
    date = root.xpath('//div[@class="el"]/span[@class="t5"]/text()')
    job = DataFrame([position,company,place,salary,date]).T
    job.columns = ['position','company','place','salary','date']
    job['page'] = 1
    job.head()

    测试运行

    封装函数【要找到自增规律】

    def Crawler_51job(df_origin,n):
        head = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'}
        s = requests.Session()
        for i in range(1,n+1):
            url = '

    'http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=000000%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=99&keyword=%E4%BF%A1%E7%94%A8%E7%AE%A1%E7%90%86&keywordtype=2&curr_page=' + str(i) + '&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&fromType=14&dibiaoid=0&confirmdate=9'

    '
            res = s.get(url,headers = head)
            res.encoding = 'gbk'
            root = etree.HTML(res.text)
            position = root.xpath('//div[@class="el"]/p/span/a/@title')
            company = root.xpath('//div[@class="el"]/span[@class="t2"]/a/@title')
            place = root.xpath('//div[@class="el"]/span[@class="t3"]/text()')
            salary = root.xpath('//div[@class="el"]/span[@class="t4"]/text()')
            date = root.xpath('//div[@class="el"]/span[@class="t5"]/text()')
            df = DataFrame([position,company,place,salary,date]).T
            df.columns = ['position','company','place','salary','date']
            df = DataFrame([position,company,place,salary,date]).T
            df.columns = ['position','company','place','salary','date']
            df['page'] = i
            time.sleep(2)
            df_origin = pd.concat([df_origin,df])
        return(df_origin)

    函数调用与数据输出

    job = Crawler_51job(job,17)

    job = Crawler_51job(job,17)
    job.to_csv('51job.csv',index = 0)

    结果(excel坏了,用记事本查看一下)

     这样就完成了一个爬虫小程序

  • 相关阅读:
    9-15
    9-5
    8-26
    8-24
    7-20
    7-17
    我离职后要干些什么
    6-18
    5-28
    5-20
  • 原文地址:https://www.cnblogs.com/keepgoingon/p/7110756.html
Copyright © 2011-2022 走看看