CSS 选择器:BeautifulSoup4
Beautiful Soup 也是一个HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 数据。
pip 安装:pip install beautifulsoup4
官方文档:http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0
抓取工具 | 速度 | 使用难度 | 安装难度 |
---|---|---|---|
正则 | 最快 | 困难 | 无(内置) |
BeautifulSoup | 慢 | 最简单 | 简单 |
lxml | 快 | 简单 | 一般 |
使用BeautifuSoup4爬腾讯社招页面
地址:http://hr.tencent.com/position.php?&start=10#a
1 # bs4_tencent.py 2 3 4 from bs4 import BeautifulSoup 5 import urllib2 6 import urllib 7 import json # 使用了json格式存储 8 9 def tencent(): 10 url = 'http://hr.tencent.com/' 11 request = urllib2.Request(url + 'position.php?&start=10#a') 12 response =urllib2.urlopen(request) 13 resHtml = response.read() 14 15 output =open('tencent.json','w') 16 17 html = BeautifulSoup(resHtml,'lxml') 18 19 # 创建CSS选择器 20 result = html.select('tr[class="even"]') 21 result2 = html.select('tr[class="odd"]') 22 result += result2 23 24 items = [] 25 for site in result: 26 item = {} 27 28 name = site.select('td a')[0].get_text() 29 detailLink = site.select('td a')[0].attrs['href'] 30 catalog = site.select('td')[1].get_text() 31 recruitNumber = site.select('td')[2].get_text() 32 workLocation = site.select('td')[3].get_text() 33 publishTime = site.select('td')[4].get_text() 34 35 item['name'] = name 36 item['detailLink'] = url + detailLink 37 item['catalog'] = catalog 38 item['recruitNumber'] = recruitNumber 39 item['publishTime'] = publishTime 40 41 items.append(item) 42 43 # 禁用ascii编码,按utf-8编码 44 line = json.dumps(items,ensure_ascii=False) 45 46 output.write(line.encode('utf-8')) 47 output.close() 48 49 if __name__ == "__main__": 50 tencent()