zoukankan      html  css  js  c++  java
  • etree和Beautiful Soup的使用

    1.lxml 是一种使用 Python 编写的库,可以迅速、灵活地处理 XML ,支持 XPath (XML Path Language),使用 lxml 的 etree 库来进行爬取网站信息

    2.Beautiful Soup支持从HTML或XML文件中提取数据的Python库;支持Python标准库中的HTML解析器;还支持一些第三方的解析器lxml, 使用的是 Xpath 语法

    Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。

    我们爬取腾讯招聘网站的链接为https://hr.tencent.com/position.php?&start=10#a

    需要获取职位名称、职位类别、招聘人数、工作地点、发布时间等信息

    一、使用etree爬取信息

    1.导入库

    1 from lxml import etree
    2 from urllib import request#进一步了解urllib和requests的区别
    3 import json

    在python.3中使用urllib库中的request模块,保存输出为json文件

    2.获取网站并写到json文件中

    1 response=request.urlopen('https://hr.tencent.com/position.php?&start=10#a')#获取网站链接
    2 resHtml=response.read()
    3 output=open('tencent1.json','wb+')#使用二进制方式打开,写入到json文件

    如果只使用w来写入文件会报错:

     1 write() argument must be str, not bytes 

    我们需要用二进制来打开改为wb+

    3.获取我们需要得到的标签

    1 html=etree.HTML(resHtml)
    2 result=html.xpath('//tr[@class="odd"] | //tr[@class="even"]')#获取tr标签下的所有class只有odd和even,用|并列
    3 for site in result:
    4     item={ }

    必须是字典形式,先定义一个空字典

    1     name=site.xpath('./td[1]/a')[0].text
    2     detailLink=site.xpath('./td[1]/a')[0].attrib['href']
    3     catalog=site.xpath('./td[2]')[0].text
    4     recruitNumber=site.xpath('./td[3]')[0].text
    5     workLocation=site.xpath('./td[4]')[0].text
    6     publishTime=site.xpath('./td[5]')[0].text

    找到我们需要的字段

    4.规范输出形式

     1     print(type(name))
     2     print(name,detailLink,catalog,recruitNumber,workLocation,publishTime)
     3     item['name']=name
     4     item['detailLink']=detailLink
     5     item['catalog']=catalog
     6     item['recruitNumber']=recruitNumber
     7     item['publishTime']=publishTime
     8 
     9     line = json.dumps(item,ensure_ascii=False) + '
    '
    10     print(line)
    11     output.write(line.encode('utf-8'))#编码格式
    12 
    13 output.close()

    运行后结果如下:

    <class 'str'>
    23677-互娱服务采购经理 position_detail.php?id=44802&keywords=&tid=0&lid=0 职能类 1 深圳 2018-10-16
    {"catalog": "职能类", "name": "23677-互娱服务采购经理", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44802&keywords=&tid=0&lid=0"}
    
    <class 'str'>
    22989-腾讯云块存储底层开发工程师(深圳) position_detail.php?id=44803&keywords=&tid=0&lid=0 技术类 2 深圳 2018-10-16
    {"catalog": "技术类", "name": "22989-腾讯云块存储底层开发工程师(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44803&keywords=&tid=0&lid=0"}
    
    <class 'str'>
    24549-渠道管理经理(政策管理方向-上海) position_detail.php?id=44804&keywords=&tid=0&lid=0 市场类 1 上海 2018-10-16
    {"catalog": "市场类", "name": "24549-渠道管理经理(政策管理方向-上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44804&keywords=&tid=0&lid=0"}
    
    <class 'str'>
    24549-渠道管理经理(ROC管理方向-上海) position_detail.php?id=44805&keywords=&tid=0&lid=0 市场类 1 上海 2018-10-16
    {"catalog": "市场类", "name": "24549-渠道管理经理(ROC管理方向-上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44805&keywords=&tid=0&lid=0"}
    
    <class 'str'>
    24549-广告营销业务分析师(上海) position_detail.php?id=44806&keywords=&tid=0&lid=0 市场类 1 上海 2018-10-16
    {"catalog": "市场类", "name": "24549-广告营销业务分析师(上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44806&keywords=&tid=0&lid=0"}
    
    <class 'str'>
    28297-RPG手游—市场和平台渠道推广(深圳) position_detail.php?id=44809&keywords=&tid=0&lid=0 产品/项目类 1 深圳 2018-10-16
    {"catalog": "产品/项目类", "name": "28297-RPG手游—市场和平台渠道推广(深圳)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44809&keywords=&tid=0&lid=0"}
    
    <class 'str'>
    21309-在线教育-运营视觉设计师(深圳) position_detail.php?id=44800&keywords=&tid=0&lid=0 设计类 2 深圳 2018-10-16
    {"catalog": "设计类", "name": "21309-在线教育-运营视觉设计师(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44800&keywords=&tid=0&lid=0"}
    
    <class 'str'>
    21309-在线教育-UI设计师(深圳) position_detail.php?id=44801&keywords=&tid=0&lid=0 设计类 2 深圳 2018-10-16
    {"catalog": "设计类", "name": "21309-在线教育-UI设计师(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44801&keywords=&tid=0&lid=0"}
    
    <class 'str'>
    22989-数据库高级产品运营经理 position_detail.php?id=44795&keywords=&tid=0&lid=0 产品/项目类 1 北京 2018-10-16
    {"catalog": "产品/项目类", "name": "22989-数据库高级产品运营经理", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44795&keywords=&tid=0&lid=0"}
    
    <class 'str'>
    27087-海外区域中心空间运营经理(深圳) position_detail.php?id=44797&keywords=&tid=0&lid=0 市场类 1 深圳 2018-10-16
    {"catalog": "市场类", "name": "27087-海外区域中心空间运营经理(深圳)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44797&keywords=&tid=0&lid=0"}

    导出的json文件如下:

    {"catalog": "职能类", "name": "23677-互娱服务采购经理", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44802&keywords=&tid=0&lid=0"}
    {"catalog": "技术类", "name": "22989-腾讯云块存储底层开发工程师(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44803&keywords=&tid=0&lid=0"}
    {"catalog": "市场类", "name": "24549-渠道管理经理(政策管理方向-上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44804&keywords=&tid=0&lid=0"}
    {"catalog": "市场类", "name": "24549-渠道管理经理(ROC管理方向-上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44805&keywords=&tid=0&lid=0"}
    {"catalog": "市场类", "name": "24549-广告营销业务分析师(上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44806&keywords=&tid=0&lid=0"}
    {"catalog": "产品/项目类", "name": "28297-RPG手游—市场和平台渠道推广(深圳)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44809&keywords=&tid=0&lid=0"}
    {"catalog": "设计类", "name": "21309-在线教育-运营视觉设计师(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44800&keywords=&tid=0&lid=0"}
    {"catalog": "设计类", "name": "21309-在线教育-UI设计师(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44801&keywords=&tid=0&lid=0"}
    {"catalog": "产品/项目类", "name": "22989-数据库高级产品运营经理", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44795&keywords=&tid=0&lid=0"}
    {"catalog": "市场类", "name": "27087-海外区域中心空间运营经理(深圳)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44797&keywords=&tid=0&lid=0"}

    二、使用Beautiful Soup爬取信息

    1.导入库

     1 from bs4 import BeautifulSoup 2 from urllib import request 3 import json 

    2.获取网站并写到json文件中

     

    1 response=request.urlopen('https://hr.tencent.com/position.php?&start=10#a')
    2 resHtml=response.read()
    3 output=open('tencent2.json','wb+')

    3.获取我们需要得到的标签

     1 html = BeautifulSoup(resHtml,'lxml')
     2 result = html.select('tr[class="even"]')
     3 result2= html.select('tr[class="odd"]')
     4 result+=result2
     5 print(len(result))
     6 
     7 for site in result:
     8     item = {}
     9 
    10     name = site.select('td a')[0].get_text()
    11     detailLink = site.select('td a')[0].attrs['href']#Tag就是 HTML 中的一个个标签,它的两个属性是name和attrs
    12     catalog = site.select('td ')[1].get_text()
    13     recruitNumber = site.select('td ')[2].get_text()
    14     workLocation = site.select('td ')[3].get_text()
    15     publishTime = site.select('td ')[4].get_text()

    4.规范输出形式

     1  item['name']=name
     2     item['detailLink'] = detailLink
     3     item['catalog'] = catalog
     4     item['recruitNumber'] = recruitNumber
     5     item['workLocation'] = workLocation
     6     item['publishTime'] = publishTime
     7 
     8     line = json.dumps(item,ensure_ascii=False)
     9     print(line)
    10 
    11     output.write(line.encode('utf-8'))
    12 
    13 output.close()

    运行结果如下:

     1 10
     2 {"detailLink": "position_detail.php?id=44802&keywords=&tid=0&lid=0", "catalog": "职能类", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "23677-互娱服务采购经理", "workLocation": "深圳"}
     3 {"detailLink": "position_detail.php?id=44804&keywords=&tid=0&lid=0", "catalog": "市场类", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "24549-渠道管理经理(政策管理方向-上海)", "workLocation": "上海"}
     4 {"detailLink": "position_detail.php?id=44806&keywords=&tid=0&lid=0", "catalog": "市场类", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "24549-广告营销业务分析师(上海)", "workLocation": "上海"}
     5 {"detailLink": "position_detail.php?id=44800&keywords=&tid=0&lid=0", "catalog": "设计类", "publishTime": "2018-10-16", "recruitNumber": "2", "name": "21309-在线教育-运营视觉设计师(深圳)", "workLocation": "深圳"}
     6 {"detailLink": "position_detail.php?id=44795&keywords=&tid=0&lid=0", "catalog": "产品/项目类", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "22989-数据库高级产品运营经理", "workLocation": "北京"}
     7 {"detailLink": "position_detail.php?id=44803&keywords=&tid=0&lid=0", "catalog": "技术类", "publishTime": "2018-10-16", "recruitNumber": "2", "name": "22989-腾讯云块存储底层开发工程师(深圳)", "workLocation": "深圳"}
     8 {"detailLink": "position_detail.php?id=44805&keywords=&tid=0&lid=0", "catalog": "市场类", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "24549-渠道管理经理(ROC管理方向-上海)", "workLocation": "上海"}
     9 {"detailLink": "position_detail.php?id=44809&keywords=&tid=0&lid=0", "catalog": "产品/项目类", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "28297-RPG手游—市场和平台渠道推广(深圳)", "workLocation": "深圳"}
    10 {"detailLink": "position_detail.php?id=44801&keywords=&tid=0&lid=0", "catalog": "设计类", "publishTime": "2018-10-16", "recruitNumber": "2", "name": "21309-在线教育-UI设计师(深圳)", "workLocation": "深圳"}
    11 {"detailLink": "position_detail.php?id=44797&keywords=&tid=0&lid=0", "catalog": "市场类", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "27087-海外区域中心空间运营经理(深圳)", "workLocation": "深圳"}

    以上为两种方法爬取网站信息,个人觉得用Beautiful Soup爬取比较方便

  • 相关阅读:
    MySQL改动rootpassword的多种方法
    略论并行处理系统的日志设计
    ERROR (UnicodeEncodeError): 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128
    UnicodeEncodeError: 'ascii' codec can't encode character u'u65e0' in position 1: ordinal not in range(128)
    python -m json.tool 中文乱码 Format JSON with python
    CentOS6.6 zookeeper完全集群搭建
    libvirt kvm云主机监控
    glance image-create
    通过上一节部署出来的 Windows instance 有时候会发现操作系统时间总是慢 8 个小时,即使手工调整好时间和时区,下次 instance 重启后又会差 8 个小时
    云监控网址
  • 原文地址:https://www.cnblogs.com/Estate-47/p/9790305.html
Copyright © 2011-2022 走看看