zoukankan html css js c++ java

Python爬虫开发【第1篇】【beautifulSoup4解析器】

CSS 选择器：BeautifulSoup4

Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

pip 安装：pip install beautifulsoup4

官方文档：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

抓取工具	速度	使用难度	安装难度
正则	最快	困难	无（内置）
BeautifulSoup	慢	最简单	简单
lxml	快	简单	一般

使用BeautifuSoup4爬腾讯社招页面

地址：http://hr.tencent.com/position.php?&start=10#a

 1 # bs4_tencent.py
 2 
 3 
 4 from bs4 import BeautifulSoup
 5 import urllib2
 6 import urllib
 7 import json    # 使用了json格式存储
 8 
 9 def tencent():
10     url = 'http://hr.tencent.com/'
11     request = urllib2.Request(url + 'position.php?&start=10#a')
12     response =urllib2.urlopen(request)
13     resHtml = response.read()
14 
15     output =open('tencent.json','w')
16 
17     html = BeautifulSoup(resHtml,'lxml')
18 
19 # 创建CSS选择器
20     result = html.select('tr[class="even"]')
21     result2 = html.select('tr[class="odd"]')
22     result += result2
23 
24     items = []
25     for site in result:
26         item = {}
27 
28         name = site.select('td a')[0].get_text()
29         detailLink = site.select('td a')[0].attrs['href']
30         catalog = site.select('td')[1].get_text()
31         recruitNumber = site.select('td')[2].get_text()
32         workLocation = site.select('td')[3].get_text()
33         publishTime = site.select('td')[4].get_text()
34 
35         item['name'] = name
36         item['detailLink'] = url + detailLink
37         item['catalog'] = catalog
38         item['recruitNumber'] = recruitNumber
39         item['publishTime'] = publishTime
40 
41         items.append(item)
42 
43     # 禁用ascii编码，按utf-8编码
44     line = json.dumps(items,ensure_ascii=False)
45 
46     output.write(line.encode('utf-8'))
47     output.close()
48 
49 if __name__ == "__main__":
50    tencent()

查看全文

相关阅读:
xshell中的nohup与&的含义
 eclipse启动服务报错:Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/maven/cli/MavenCli : Unsupported major.minor version 51.0
关于MyEclipse启动报错:Error starting static Resources;下面伴随Failed to start component [StandardServer[8005]]; A child container failed during start.的错误提示解决办法.
组件/服务之间的通信-事件广播和订阅可以有效减少全局变量
 #css3# 可以多背景图设置
 #RXJS# 基础
 #TS# get/set
#css3# : vs ::
#css3# box-sizing
#DOM# 最佳实践：head里面标签的顺序

原文地址：https://www.cnblogs.com/loser1949/p/9460821.html