zoukankan      html  css  js  c++  java
  • Python爬虫开发【第1篇】【beautifulSoup4解析器】

    CSS 选择器:BeautifulSoup4

    Beautiful Soup 也是一个HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 数据。

    pip 安装:pip install beautifulsoup4

    官方文档:http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

    抓取工具速度使用难度安装难度
    正则 最快 困难 无(内置)
    BeautifulSoup 最简单 简单
    lxml 简单 一般

     

    使用BeautifuSoup4爬腾讯社招页面

    地址:http://hr.tencent.com/position.php?&start=10#a

     1 # bs4_tencent.py
     2 
     3 
     4 from bs4 import BeautifulSoup
     5 import urllib2
     6 import urllib
     7 import json    # 使用了json格式存储
     8 
     9 def tencent():
    10     url = 'http://hr.tencent.com/'
    11     request = urllib2.Request(url + 'position.php?&start=10#a')
    12     response =urllib2.urlopen(request)
    13     resHtml = response.read()
    14 
    15     output =open('tencent.json','w')
    16 
    17     html = BeautifulSoup(resHtml,'lxml')
    18 
    19 # 创建CSS选择器
    20     result = html.select('tr[class="even"]')
    21     result2 = html.select('tr[class="odd"]')
    22     result += result2
    23 
    24     items = []
    25     for site in result:
    26         item = {}
    27 
    28         name = site.select('td a')[0].get_text()
    29         detailLink = site.select('td a')[0].attrs['href']
    30         catalog = site.select('td')[1].get_text()
    31         recruitNumber = site.select('td')[2].get_text()
    32         workLocation = site.select('td')[3].get_text()
    33         publishTime = site.select('td')[4].get_text()
    34 
    35         item['name'] = name
    36         item['detailLink'] = url + detailLink
    37         item['catalog'] = catalog
    38         item['recruitNumber'] = recruitNumber
    39         item['publishTime'] = publishTime
    40 
    41         items.append(item)
    42 
    43     # 禁用ascii编码,按utf-8编码
    44     line = json.dumps(items,ensure_ascii=False)
    45 
    46     output.write(line.encode('utf-8'))
    47     output.close()
    48 
    49 if __name__ == "__main__":
    50    tencent()

     

     

  • 相关阅读:
    xpath的几个常用规则
    xpath定位不到原因浅析
    这一代人得学习
    scrapy之Request对象
    cookie字段属性解析
    selenium中get_cookies()和add_cookie()的用法
    python中生成器generator
    swagger demo code
    ctrip-apollo
    eclipse 快捷键使用日志
  • 原文地址:https://www.cnblogs.com/loser1949/p/9460821.html
Copyright © 2011-2022 走看看