zoukankan html css js c++ java

Python爬取知乎问题收藏夹爬虫入门

简介

知乎的网站是比较好爬的，没有复杂的反爬手段，适合初学爬虫的人作为练习
因为刚刚入门python，所以只是先把知乎上热门问题的一些主要信息保存到数据库中，待以后使用这些信息进行数据分析，爬取的网页链接是赞同超过1000的回答

网页分析

1.分析网站的页面结构

准备提取热门问题的问题、答主、赞数、评论数等内容

界面分析

2.分析网站的元素

选择页面中需要爬取的内容对应的元素，分析特征(class,id等)，稍后使用BeautifulSoap爬取这些内容

HTML分析

3.用Beautifulsoup解析获取的网页

这些网页的url的数字是递增的，拼接字符串就可以得到网页的链接了

url_part = "https://www.zhihu.com/collection/19928423?page="  # 赞数超过一千的收藏夹
url = url_part + str(i)  # 拼接知乎爬取链接

用BeautifulSoap解析部分的代码

def find_answers(url, collection):
    get_html = requests.get(url, headers=Web.headers)  # requests请求页面内容
    soup = BeautifulSoup(get_html.text, 'lxml')  # BeautifulSoup解析页面内容
    items = soup.find_all('div', class_="zm-item")  # 获取所有的热门问题内容
    success = 0
    error = 0
    for item in items:
        try:
            data = store_answer(item)
            collection.insert(data)  # 插入到数据表中
        except AttributeError as e:
            error += 1  # 发生错误
        else:
            success += 1

def store_answer(answer):
    data = {
        "title": answer.find("h2", class_="zm-item-title").text,  # 问题题目
        "like_num": answer.find("div", class_="zm-item-vote").text,  # 问题赞数
        "answer_user_name": answer.find("div", class_="answer-head").find("span", class_="author-link-line").text,  # 答主姓名
        "answer_user_sign": answer.find("div", class_="answer-head").find("span", class_="bio").text,  # 答主签名
        "answer": answer.find("div", class_="zh-summary summary clearfix").text,  # 问题摘要
        "time": answer.find("p", class_="visible-expanded").find("a", class_="answer-date-link meta-item").text,  # 问题编辑时间
        "comment": answer.find("div", class_="zm-meta-panel").find("a",
                                                                   class_="meta-item toggle-comment js-toggleCommentBox").text,
        # 问题评论数
        "link": answer.find("link").get("href")  # 问题链接
    }
    return data

4.完整代码

import time  # 计算程序时间所用的库
import requests  # 获取页面所用的库
from bs4 import BeautifulSoup  # 提取页面所用的库
from pymongo import MongoClient  # 连接数据库所用的库


class Web:
    headers = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/"
                             "56.0.2924.87 Safari/537.36"}  # 请求头
    url_part = "https://www.zhihu.com/collection/19928423?page="  # 赞数超过一千的收藏夹


def get_collection():
    client = MongoClient('mongodb://localhost:27017/')  # 连接到Mongodb
    db = client.data  # 打开数据库 "data"（数据库名称可以自己修改）
    collection = db.zhihu  # 打开表 "zhihu"（表名称可以自己修改）
    return collection


def store_answer(answer):
    data = {
        "title": answer.find("h2", class_="zm-item-title").text,  # 问题题目
        "like_num": answer.find("div", class_="zm-item-vote").text,  # 问题赞数
        "answer_user_name": answer.find("div", class_="answer-head").find("span", class_="author-link-line").text,  # 答主姓名
        "answer_user_sign": answer.find("div", class_="answer-head").find("span", class_="bio").text,  # 答主签名
        "answer": answer.find("div", class_="zh-summary summary clearfix").text,  # 问题摘要
        "time": answer.find("p", class_="visible-expanded").find("a", class_="answer-date-link meta-item").text,  # 问题编辑时间
        "comment": answer.find("div", class_="zm-meta-panel").find("a",
                                                                   class_="meta-item toggle-comment js-toggleCommentBox").text,
        # 问题评论数
        "link": answer.find("link").get("href")  # 问题链接
    }
    return data


def find_answers(url, collection):
    get_html = requests.get(url, headers=Web.headers)  # requests请求页面内容
    soup = BeautifulSoup(get_html.text, 'lxml')  # BeautifulSoup解析页面内容
    items = soup.find_all('div', class_="zm-item")  # 获取所有的热门问题内容
    success = 0
    error = 0
    for item in items:
        try:
            data = store_answer(item)
            collection.insert(data)  # 插入到数据表中
        except AttributeError as e:
            error += 1  # 发生错误
        else:
            success += 1
    print("Error: %d" % error, end=' ')
    return success


def get_zhihu():
    collection = get_collection()
    start_time = time.time()  # 获取初始时间
    answer_num = 0  # 记录已爬取问题数
    start_page = 1  # 记录已爬取网页数
    last_page = 6319  # 爬取的收藏夹最后一页的页码，可以根据当前数目自行调整
    page_list = [i for i in range(start_page, last_page)]
    for page in page_list:
        print("Page: %d" % page)
        url = Web.url_part + str(page)
        try:
            answer_num += find_answers(url, collection)
            print("Used: %.1fs Total: %d" % (time.time() - start_time, answer_num))
        except Exception as e:
            print(e)
            page_list.append(page)


if __name__ == "__main__":
    get_zhihu()  # 运行爬虫程序
    print("Done")

代码基于Python3.6环境，需要先安装pymongo,BeautifulSoap等依赖库,可在终端中输入pip3 install pymongo bs4 lxml -i https://pypi.douban.com/simple/安装，运行爬虫后的数据库的结果如下图

数据库

总共爬取了近六万条回答，可以在终端使用mongoexport -d data -c zhihu --csv -o zhihu.csv -f question,user_info_name,link,like,comment,time,answer,user_info_sign这条命令将数据库中的数据导出到zhihu.csv(utf-8编码)文件中,也可以改变命令的参数导出成json格式

0.安装及数据库入门

运行本项目需要安装 Python3，Mongodb数据库，可以使用Compass或者Robomongo等可视化软件管理数据库。

查看全文

相关阅读:
MyEclipse编码集设置
 Tomcat内存溢出问题解决
 避免头文件多次编译
 C++指针学习（1）
C++头文件和实现（用复数类举例）
从helloworld开始
 标准库string类型
 浅谈Lua的Coroutine协程的多"线程"并发模型
 关于闭包函数的概念和原理
 笔记

原文地址：https://www.cnblogs.com/ZKin/p/9471054.html

Python爬取知乎问题收藏夹 爬虫入门