zoukankan      html  css  js  c++  java
  • 爬取某招聘网站特殊字段的信息

    举例:爬取带有“数据分析”职位的信息

    需要材料:Python、

    pip install requests
    pip install  lxml 
    pip install  pandas 

    第一步:引用

    import time
    import requests
    from lxml import etree
    import pandas as pd
    from pandas import DataFrame
    from pandas import Series

    第二步:找到可打开的正确的url

    url = 'http://search.51job.com/list/080200,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=’

    第三步:向这个URL发出请求,文本编码修改为中文字符

    head = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'}
    s = requests.Session()
    res = s.get(url,headers = head)
    res.encoding = 'gbk'
    root = etree.HTML(res.text)

    第四步:在原网址查看所需字段的位置

    读取所需数据

    position = root.xpath('//div[@class="el"]/p/span/a/@title')
    company = root.xpath('//div[@class="el"]/span[@class="t2"]/a/@title')
    place = root.xpath('//div[@class="el"]/span[@class="t3"]/text()')
    salary = root.xpath('//div[@class="el"]/span[@class="t4"]/text()')
    date = root.xpath('//div[@class="el"]/span[@class="t5"]/text()')
    job = DataFrame([position,company,place,salary,date]).T
    job.columns = ['position','company','place','salary','date']
    job['page'] = 1
    job.head()

    测试运行

    封装函数【要找到自增规律】

    def Crawler_51job(df_origin,n):
        head = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'}
        s = requests.Session()
        for i in range(1,n+1):
            url = '

    'http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=000000%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=99&keyword=%E4%BF%A1%E7%94%A8%E7%AE%A1%E7%90%86&keywordtype=2&curr_page=' + str(i) + '&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&fromType=14&dibiaoid=0&confirmdate=9'

    '
            res = s.get(url,headers = head)
            res.encoding = 'gbk'
            root = etree.HTML(res.text)
            position = root.xpath('//div[@class="el"]/p/span/a/@title')
            company = root.xpath('//div[@class="el"]/span[@class="t2"]/a/@title')
            place = root.xpath('//div[@class="el"]/span[@class="t3"]/text()')
            salary = root.xpath('//div[@class="el"]/span[@class="t4"]/text()')
            date = root.xpath('//div[@class="el"]/span[@class="t5"]/text()')
            df = DataFrame([position,company,place,salary,date]).T
            df.columns = ['position','company','place','salary','date']
            df = DataFrame([position,company,place,salary,date]).T
            df.columns = ['position','company','place','salary','date']
            df['page'] = i
            time.sleep(2)
            df_origin = pd.concat([df_origin,df])
        return(df_origin)

    函数调用与数据输出

    job = Crawler_51job(job,17)

    job = Crawler_51job(job,17)
    job.to_csv('51job.csv',index = 0)

    结果(excel坏了,用记事本查看一下)

     这样就完成了一个爬虫小程序

  • 相关阅读:
    VS2005 新控件之 容器 《PlaceHolder》
    vs2005/.NET2.0 控件演示之 超级链接 《HyperLink》
    VS2005 控件演示之 容器《Panel》
    投票系统[单/复选,投票数量/项随意][C#/SQL/文件包括数据库代码全部在]
    datalgrid\datalist\repeater\的用法以及相关小技巧
    使用XmlDocument类完成对XML的查、删、添、改(不完美,望指教)
    Repeater读取数据并分页
    vs2005控件演示之 MultiView
    datagrid 分页及隐藏指定列再以及鼠标动作(换背景)
    VS2.0控件之日历《Calendar》C#
  • 原文地址:https://www.cnblogs.com/keepgoingon/p/7110756.html
Copyright © 2011-2022 走看看