zoukankan      html  css  js  c++  java
  • 免费简历的爬取

    # 免费的简历模板进行爬取本地保存  
    # http://sc.chinaz.com/jianli/free.html
    # http://sc.chinaz.com/jianli/free_2.html
    
    import requests
    from lxml import etree
    import os
    
    dirName = './resumeLibs'
    if not os.path.exists(dirName):
        os.mkdir(dirName)
    
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'
    }
    url = 'http://sc.chinaz.com/jianli/free_%d.html'
    for page in range(1,2):
        if page == 1:
            new_url = 'http://sc.chinaz.com/jianli/free.html'
        else:
            new_url = format(url%page)
        page_text = requests.get(url=new_url,headers=headers).text
        tree = etree.HTML(page_text)
        a_list = tree.xpath('//div[@id="container"]/div/p/a')
        
        for a in a_list:
            a_src = a.xpath('./@href')[0]
            a_title = a.xpath('./text()')[0]
            a_title = a_title.encode('iso-8859-1').decode('utf-8')
            # 爬取下载页面
            page_text = requests.get(url=a_src,headers=headers).text
            tree = etree.HTML(page_text)
            dl_src = tree.xpath('//div[@id="down"]/div[2]/ul/li[8]/a/@href')[0]
            
            resume_data = requests.get(url=dl_src,headers=headers).content
            resume_name = a_title
            resume_path = dirName + '/' + resume_name + '.rar'
            with open(resume_path,'wb') as fp:
                fp.write(resume_data)
                print(resume_name,'下载成功!')
    
  • 相关阅读:
    JAVA内部类详解
    表、栈和队列
    大型网站架构演化<二>
    eclipse中build path 中JDK与java compiler compliance level的问题(转)
    XFire构建服务端Service的两种方式
    SpringMVC简单例子
    Mybatis
    java静态代码块 类加载顺序问题。
    Tomcat6.0数据源配置
    解析xml的几种方式
  • 原文地址:https://www.cnblogs.com/straightup/p/13664724.html
Copyright © 2011-2022 走看看