zoukankan html css js c++ java

selenium安装使用，selenium模拟浏览器抓取51job上的 python职位和工资

今天整理下昨天学习的代码，主要是学习了selenium模拟浏览器登录来抓取51job，下面先分享下使用selenium需要注意的点：

1、在使用selenium的时候，可以通过pip install selenium来进行安装；

2、安装完selenium后，需要去网上下载相应的chromedriver，这是下载的地址：https://sites.google.com/a/chromium.org/chromedriver/home，可以根据谷歌的版本号来进行对应下载；

3、将下载的chromedriver.exe放到到谷歌浏览的安装目录，打开桌面的谷歌快捷方式——右键——打开文件位置；

4、配置环境变量在电脑的高级系统配置——path——编辑——新建——添加上路径地址： C:Program Files (x86)GoogleChromeApplicationchromedriver.exe ；

5、配置完成后，就可以成功使用selenium了。

下面是用selenium模拟浏览器抓取51job的两段代码，一段为抓取python相关职位的数量，另一段为抓取的所有python职位的薪资。

（一）爬取51job相关职位的数量：

import selenium  #测试框架
import selenium.webdriver  #模拟浏览器
import re

def getnumberbyname(searchname):
    url="https://search.51job.com/list/020000,000000,0000,00,9,99,"+searchname+",2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="
    driver=selenium.webdriver.Chrome(executable_path="C:Program Files (x86)GoogleChromeApplicationchromedriver") #调用谷歌浏览器
    driver.get(url)  #访问链接
    pagesource=driver.page_source   #抓取网页源代码
    #print(pagesource)  #打印源代码
    # 正则表达式
    # s匹配任何不可见字符，包括空格、制表符、换页符等等。等价于[ f

	v]
    #匹配任何可见字符。等价于[ ^ f

	v]      [] 匹配其中的任意一个
    restr="""<div class="rt">([sS]*?)</div>"""
    regex=re.compile(restr,re.IGNORECASE)
    mylist=regex.findall(pagesource)
    driver.close()  #关闭
    #print(mylist)
    if len(mylist)==0:
        print("失败")
    else:
        #print(mylist[0])
        newstr=mylist[0].strip()  #.strip()  去除前后空格空白符
        print(pystr+newstr)
    return mylist
pythonlist=["python","python 运维","python 测试","python 数据","python web","python 爬虫"]
for pystr in pythonlist:
    getnumberbyname(pystr)

（二）抓取的所有python职位的薪资:

import selenium  #测试框架
import selenium.webdriver  #模拟浏览器
import redef getnumberbyname(searchname,i):
    url="https://search.51job.com/list/020000,000000,0000,00,9,99,"+searchname+",2,"+i+".html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="
    driver=selenium.webdriver.Chrome(executable_path="C:Program Files (x86)GoogleChromeApplicationchromedriver") #调用谷歌浏览器
    driver.get(url)  #访问链接
    pagesource=driver.page_source   #抓取网页源代码
    #print(pagesource)  #打印源代码
    # 正则表达式
    # s匹配任何不可见字符，包括空格、制表符、换页符等等。等价于[ f

	v]
    #匹配任何可见字符。等价于[ ^ f

	v]      [] 匹配其中的任意一个
    restr="""<span class="t4">(.*?)</span>"""
    regex=re.compile(restr,re.IGNORECASE)
    mylist=regex.findall(pagesource)
    driver.close()  #关闭
    #print(mylist)
    if len(mylist)==0:
        print("失败")
    else:
        #print(mylist[0])
        newstr=mylist  #.strip()  去除前后空格空白符
        print(newstr)
    return mylist
#pythonlist=["python","python 运维","python 测试","python 数据","python web","python 爬虫"]
#for pystr in pythonlist:
for num in range(1,86):
    i=str(num)
    try:
        getnumberbyname("python",i)

    except:
        ""

查看全文

相关阅读:
SQLSERVER 数据连接字符串
 c#中间隔两个小时执行一次
 移动开发者的自学宝典：十大在线编程学习网站
 C#j将DataTable转换成List
C#将list转换为datatable
SQL Server 高性能写入摘抄
 解决控制面板删除不了的程序卸载
 卸载SQL 2008 R2 出现警告26003
删除重复数据
 大数据量数据优化

原文地址：https://www.cnblogs.com/my-global/p/12432903.html