zoukankan      html  css  js  c++  java
  • 使用Beautiful Soup爬取知乎发现【方法选择器find_all】【CSS选择器,select】

    使用Beautiful Soup

     Beautiful Soup在解析时实际上依赖解析器,它除了支持Python标准库中的HTML解析器外,还支持一些第三方解析器(比如lxml)。

    解析器

    使用方法

    优势

    劣势

    Python标准库

    BeautifulSoup(markup, "html.parser")

    Python的内置标准库、执行速度适中、文档容错能力强

    Python 2.7.3及Python 3.2.2之前的版本文档容错能力差

    lxml HTML解析器

    BeautifulSoup(markup, "lxml")

    速度快、文档容错能力强

    需要安装C语言库

    lxml XML解析器

    BeautifulSoup(markup, "xml")

    速度快、唯一支持XML的解析器

    需要安装C语言库

    html5lib

    BeautifulSoup(markup, "html5lib")

    最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档

    速度慢、不依赖外部扩展

    一、lxml解析器有解析HTML和XML的功能,而且速度快,容错能力强,所以先用它来解析。

    用户名(1) 

    用户名(2)

    if item.find_all(class_ = 'author-link'):
    author = item.find_all(class_ = 'author-link')[0].string
    else:
    author = item.find_all(class_ = 'name')[0].string

    另外,还有许多查询方法,其用法与find_all()find()方法完全相同,只不过查询范围不同。

    另外,还有许多查询方法,其用法与前面介绍的find_all()、find()方法完全相同,只不过查询范围不同,这里简单说明一下。

    find_parents()和find_parent():前者返回所有祖先节点,后者返回直接父节点。

    find_next_siblings()和find_next_sibling():前者返回后面所有的兄弟节点,后者返回后面第一个兄弟节点。

    find_previous_siblings()和find_previous_sibling():前者返回前面所有的兄弟节点,后者返回前面第一个兄弟节点。

    find_all_next()和find_next():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点。

    find_all_previous()和find_previous():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点。

     

     

    既可以为属性值,也可以为文本

    q = item.find_all(class_ = 'bio')[0].string


    q = item.find_all(class_ = 'bio')[0].attrs['title']

     1 import requests
     2 import json
     3 from bs4 import BeautifulSoup
     4 
     5 url = 'https://www.zhihu.com/explore'
     6 headers = {
     7     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
     8 }
     9 r = requests.get(url, headers=headers)
    10 soup = BeautifulSoup(r.text, 'lxml')
    11 explore = {}
    12 items = soup.find_all(class_ = 'explore-feed feed-item')
    13 for item in items:
    14     question = item.find_all('h2')[0].string
    15     #print(question)
    16     if item.find_all(class_ = 'author-link'):
    17         author = item.find_all(class_ = 'author-link')[0].string
    18     else:
    19         author = item.find_all(class_ = 'name')[0].string
    20     #print(author)
    21     answer = item.find_all(class_ = 'content')[0].string
    22     #print(answer)
    23     #q = item.find_all(class_ = 'bio')[0].string
    24     q = item.find_all(class_ = 'bio')[0].attrs['title']
    25     #print(q)
    26 
    27     explore = {
    28         "question" : question,
    29         "author" : author,
    30         "answer" : answer,
    31         "q": q,
    32     } 
    33 
    34     with open("explore.json", "a") as f:
    35         #f.write(json.dumps(items, ensure_ascii = False).encode("utf-8") + "
    ")
    36         f.write(json.dumps(explore, ensure_ascii = False) + "
    ")
         for t in item.find_all(class_ = 'bio'):
             q =t.get('title') 
     1 import requests
     2 import json
     3 from bs4 import BeautifulSoup
     4 
     5 url = 'https://www.zhihu.com/explore'
     6 headers = {
     7     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
     8 }
     9 r = requests.get(url, headers=headers)
    10 soup = BeautifulSoup(r.text, 'lxml')
    11 explore = {}
    12 items = soup.find_all(class_ = 'explore-feed feed-item')
    13 for item in items:
    14     question = item.find_all('h2')[0].string
    15     #print(question)
    16     if item.find_all(class_ = 'author-link'):
    17         author = item.find_all(class_ = 'author-link')[0].string
    18     else:
    19         author = item.find_all(class_ = 'name')[0].string
    20     #print(author)
    21     answer = item.find_all(class_ = 'content')[0].string
    22     #print(answer)
    23     #q = item.find_all(class_ = 'bio')[0].string
    24     #q = item.find_all(class_ = 'bio')[0].attrs['title']
    25     for t in item.find_all(class_ = 'bio'):
    26         q =t.get('title')    
    27     print(q)
    28 
    29     explore = {
    30         "question" : question,
    31         "author" : author,
    32         "answer" : answer,
    33         "q": q,
    34     } 
    35 
    36     with open("explore.json", "a") as f:
    37         #f.write(json.dumps(items, ensure_ascii = False).encode("utf-8") + "
    ")
    38         f.write(json.dumps(explore, ensure_ascii = False) + "
    ")

    二、使用Python标准库中的HTML解析器

    soup = BeautifulSoup(r.text, 'html.parser')

    三、Beautiful Soup还提供了另外一种选择器,那就是CSS选择器。

     使用CSS选择器时,只需要调用select()方法,传入相应的CSS选择器即可。

     1 import requests
     2 from bs4 import BeautifulSoup
     3 import json
     4 
     5 url = 'https://www.zhihu.com/explore'
     6 headers = {
     7     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
     8 }
     9 r = requests.get(url, headers=headers)
    10 soup = BeautifulSoup(r.text, 'lxml')
    11 #print(soup)
    12 explore = {}
    13 items = soup.select('.explore-tab .feed-item')
    14 #items = soup.select('#js-explore-tab .explore-feed feed-item')
    15 #print(items)
    16 for item in items:
    17 
    18     question = item.select('h2')[0].string
    19     if item.select('.author-link'):
    20         author = item.select('.author-link')[0].string
    21     else:
    22         author = item.select('.name')[0].string
    23     answer = item.select('.content')[0].string
    24     if item.select('.bio'):
    25         q = item.select('.bio')[0].string
    26     #print(question)
    27     #print(author)
    28     #print(answer)
    29     #print(q)
    30     explore = {
    31         "question" : question,
    32         "author" : author,
    33         "answer" : answer,
    34         "q": q,
    35     } 
    36 
    37     with open("explore.json", "a") as f:
    38         #f.write(json.dumps(items, ensure_ascii = False).encode("utf-8") + "
    ")
    39         f.write(json.dumps(explore, ensure_ascii = False) + "
    ")

     获取文本,除了string属性,还有一个方法,get_text()

     1 import requests
     2 from bs4 import BeautifulSoup
     3 import json
     4 
     5 url = 'https://www.zhihu.com/explore'
     6 headers = {
     7     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
     8 }
     9 r = requests.get(url, headers=headers)
    10 soup = BeautifulSoup(r.text, 'lxml')
    11 #print(soup)
    12 explore = {}
    13 items = soup.select('.explore-tab .feed-item')
    14 #items = soup.select('#js-explore-tab .explore-feed feed-item')
    15 #print(items)
    16 for item in items:
    17 
    18     question = item.select('h2')[0].get_text()
    19     if item.select('.author-link'):
    20         author = item.select('.author-link')[0].get_text()
    21     else:
    22         author = item.select('.name')[0].get_text()
    23     answer = item.select('.content')[0].get_text()
    24     if item.select('.bio'):
    25         #q = item.select('.bio')[0].string
    26         q = item.select('.bio')[0].attrs['title']
    27     else:
    28         q = None
    29     #print(question)
    30     #print(author)
    31     #print(answer)
    32     #print(q)
    33     explore = {
    34         "question" : question,
    35         "author" : author,
    36         "answer" : answer,
    37         "q": q,
    38     } 
    39 
    40     with open("explore.json", "a") as f:
    41         #f.write(json.dumps(items, ensure_ascii = False).encode("utf-8") + "
    ")
    42         f.write(json.dumps(explore, ensure_ascii = False) + "
    ")
  • 相关阅读:
    程序员与HR博弈之:有城府的表达你的兴趣爱好
    也谈创业企业CEO该拿多少工资
    今日互联网关注(写在清明节后):每天都有值得关注的大变化
    另眼看纸媒电商的发展
    今日互联网关注(20140331):善待和你裸婚的员工
    看着烧了十几亿的打车软件,我们能跟着模仿点什么?
    一款云端开发环境平台,思考互联网产品模式
    2018年的医保控费思路
    新形势下国家医疗保障局信息化建设注意点(五)强化监管信息化
    新形势下国家医疗保障局信息化建设注意点(四)推进电子医保卡
  • 原文地址:https://www.cnblogs.com/wanglinjie/p/9249230.html
Copyright © 2011-2022 走看看