zoukankan      html  css  js  c++  java
  • python爬虫入门-简书七日热门文章数据

    前言

    以我的理解,写一个爬虫分为以下几个步骤

    1. 分析目标网站
    2. 访问单个网页地址,获取网页源代码
    3. 提取数据
    4. 保存数据
    5. 抓取剩余网页
      以下开始正题

    1. 分析目标网站

    1. 目标网站为简书七日热门文章 http://www.jianshu.com/trending/weekly 。 提取数据为用户,标题,阅读量,评论量,获赞量,打赏数
       
      81033-9fa77fed1c959a01.png
      提取目标
    2. 用chrome tools 查看这个网页,是用ajax加载的,分析规律,发现url为 http://www.jianshu.com/trending/weekly?page=1 , page=1 至 page=5.
       
      81033-cf40805da6a73c74.png
      url规律

    2. 访问单个网页地址,获取网页源代码

    1. 设置url
    url = 'http://www.jianshu.com/trending/weekly?page=1'
    
    1. 设置头部信息(用来伪装请求,本案例中可省略)
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
    request = urllib2.Request(url=url, headers=headers)
    
    1. 发送请求和接收响应
    html = urllib2.urlopen(request)
    

    3. 从源代码中提取数据

    # 先用BeautifulSoup转换一下,以便之后解析
    bsObj = BeautifulSoup(html.read(), 'lxml')
    
    1. 抓出每篇文章的源代码,并提取目标数据(写的很差劲,just work)


       
      81033-a260e0eba158ab59.png
      文章源码
    items = bsObj.findAll("div", {"class": "content"})
        for item in items:
            author = item.find("a", {"class": "blue-link"}).get_text()
            title = item.find("a", {"class": "title"}).get_text()
            other = item.find("div", {"class": "meta"}).get_text()
            pattern = re.compile('(d+)')
            content = re.findall(pattern, other)
            view = content[0]
            comment = content[1]
            like = content[2]
            money = content[3] if (len(content) == 4) else 0 # 非常不严谨,暂时这么做
    

    4. 保存数据

    with open('articlesOfSevenDays.csv', 'a') as resultFile:
        wr = csv.writer(resultFile, dialect= 'excel')
        wr.writerow([author,title,view,comment,like,money])
    

    因为遇到编码问题,所以添加以下代码

    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    

    5. 抓取剩余网页

    for i in range(1,6):
        print "开始抓取第{}页...".format(i)
        url = 'http://www.jianshu.com/trending/weekly?page={}'. format(i)
        # 重复之前提取数据和保存数据的代码
    

    完整的代码

    #!/usr/bin/env python
    # coding=utf-8
    from urllib.request import Request,urlopen
    from bs4 import BeautifulSoup
    from urllib.error import HTTPError
    import re
    import csv
    import os
    
    
    def getHTML(i):
        url = 'http://www.jianshu.com/trending/weekly?page={}'.format(i)
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
        try:
            request = Request(url=url, headers=headers)
            html = urlopen(request)
            bsObj = BeautifulSoup(html.read(), 'lxml')
            items = bsObj.findAll("div", {"class": "content"})
        except HTTPError as e:
            print(e)
            exit()
        return items
    
    def getArticleInfo(items):
        articleInfo= []
        for item in items:
            author = item.find("a", {"class": "blue-link"}).get_text()
            title = item.find("a", {"class": "title"}).get_text()
            other = item.find("div", {"class": "meta"}).get_text()
            pattern = re.compile('(d+)')
            content = re.findall(pattern, other)
            view = content[0]
            comment = content[1]
            like = content[2]
            money = content[3] if (len(content) == 4) else 0  # 不太严谨
            articleInfo.append([author, title, view, comment, like, money])
        return articleInfo
    
    dir = "../jianshu/"
    if not os.path.exists(dir):
        os.makedirs(dir)
    csvFile = open("../jianshu/jianshuSevenDaysArticles.csv","wt",encoding='utf-8')
    writer = csv.writer(csvFile)
    writer.writerow(("author", "title", "view", "comment", "like", "money"))
    try:
        for i in range(1, 6):
            items = getHTML(i)
            articleInfo = getArticleInfo(items)
            for item in articleInfo:
                    writer.writerow(item)
    
    finally:
        csvFile.close()
    

    抓取结果


     
    81033-6e86b43d993f439b.png
    image.png

    总结

    1. 页面解析水平不好,接下来要学习:正则表达式,beautifulSoup,lxml
    2. 遇到的编码问题待学习
     
     
  • 相关阅读:
    GGEditor
    Vue 项目(HTML5 History 模式) 部署服务器
    mysql连接状态
    mysql连接状态
    HBase配置性能调优
    HBase配置性能调优
    spark streaming检查点使用
    spark streaming检查点使用
    spark streaming的有状态例子
    spark streaming的有状态例子
  • 原文地址:https://www.cnblogs.com/jeff-ideas/p/10540351.html
Copyright © 2011-2022 走看看