zoukankan      html  css  js  c++  java
  • 爬取全部的校园新闻

    作业要求:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3002

    0.从新闻url获取点击次数,并整理成函数

    • newsUrl
    • newsId(re.search())
    • clickUrl(str.format())
    • requests.get(clickUrl)
    • re.search()/.split()
    • str.lstrip(),str.rstrip()
    • int
    • 整理成函数
    • 获取新闻发布时间及类型转换也整理成函数

    1.从新闻url获取新闻详情:

     字典,anews

    2.从列表页的url获取新闻url:

     列表append(字典) alist

    3.生成所页列表页的url并获取全部新闻 :

     列表extend(列表) allnews*每个同学爬学号尾数开始的10个列表页

    4.设置合理的爬取间隔

      import time

      import random

      time.sleep(random.random()*3)

    5.用pandas做简单的数据处理并保存

      保存到csv或excel文件 

      newsdf.to_csv(r'F:duym爬虫gzccnews.csv')

    代码如下:

     1 import re
     2 from bs4 import BeautifulSoup
     3 from datetime import datetime
     4 import requests
     5 import pandas as pd
     6 import time
     7 import random
     8 
     9 """新闻点击次数"""
    10 def newsClick(newsUrl):
    11     newsId = re.findall('(d+)', newsUrl)[-1]
    12     clickUrl = 'http://oa.gzcc.cn/api.php?op=count&id={}&modelid=80'.format(newsId)
    13     resClicks = requests.get(clickUrl).text
    14     resClick = int(re.search("hits'[)].html[(]'(d*)'[)]", resClicks).groups(0)[0])
    15     return resClick
    16 
    17 """新闻发布时间"""
    18 def newsDateTime(showinfo):
    19     newsDate = showinfo.split()[0].split(':')[1]
    20     newsTime = showinfo.split()[1]
    21     newsDateTime = newsDate + ' ' + newsTime
    22     dateTime = datetime.strptime(newsDateTime, '%Y-%m-%d %H:%M:%S')  #类型转换
    23     return dateTime
    24 
    25 """新闻字典"""
    26 def newsDicts(newsUrl):
    27     newsText = requests.get(newsUrl)
    28     newsText.encoding = 'utf-8'
    29     newsSoup = BeautifulSoup(newsText.text, 'html.parser')
    30     newsDict = {}
    31     newsDict['newsTitle'] = newsSoup.select('.show-title')[0].text
    32     showinfo = newsSoup.select('.show-info')[0].text
    33     newsDict['newsDateTime'] = newsDateTime(showinfo)
    34     newsDict['newsClick'] = newsClick(newsUrl)
    35     return newsDict
    36 
    37 """新闻列表"""
    38 def newsList(newsUrl):
    39     newsText = requests.get(newsUrl)
    40     newsText.encoding = 'utf-8'
    41     newsSoup = BeautifulSoup(newsText.text, 'html.parser')
    42     newsList = []
    43     for news in newsSoup.select('li'):
    44         if len(news.select('.news-list-title')) > 0:
    45             url = news.select('a')[0]['href']
    46             newsDesc = news.select('.news-list-description')[0].text
    47             newsDict = newsDicts(url)
    48             newsDict['newsUrl'] = url
    49             newsDict['description'] = newsDesc
    50             newsList.append(newsDict)
    51     return newsList
    52 
    53 """27-37页新闻列表"""
    54 def allNews():
    55     allnews = []
    56     for i in range(27,38):
    57         newsUrl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html'.format(i)
    58         allnews.extend(newsList(newsUrl))
    59         time.sleep(random.random() * 3)   #爬取间隔
    60     return allnews
    61 
    62 newsDF = pd.DataFrame(allNews())
    63 newsDF.to_csv('gzccnews.csv')   #保存为csv文件

    保存gzccnews.csv文件截图如下:

  • 相关阅读:
    今天开始用 VSU 2010
    Visual Studio 2010 模型设计工具 基本应用
    Asp.Net访问Oracle 数据库 执行SQL语句和调用存储过程
    Enterprise Library 4.1 Security Block 快速使用图文笔记
    解决“System.Data.OracleClient 需要 Oracle 客户端软件 8.1.7 或更高版本。”(图)
    一个Oracle存储过程示例
    Enterprise Library 4.1 Application Settings 快速使用图文笔记
    Oracle 10g for Windows 简体中文版的安装过程
    Oracle 11g for Windows 简体中文版的安装过程
    Oracle 9i 数据库 创建数据库 Net 配置 创建表 SQL查询 创建存储过程 (图)
  • 原文地址:https://www.cnblogs.com/leo0724/p/10702133.html
Copyright © 2011-2022 走看看