zoukankan      html  css  js  c++  java
  • 爬取全部的校园新闻

    本次作业的要求来自于:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3002

    0.从新闻url获取点击次数,并整理成函数

    • newsUrl
    • newsId(re.search())
    • clickUrl(str.format())
    • requests.get(clickUrl)
    • re.search()/.split()
    • str.lstrip(),str.rstrip()
    • int
    • 整理成函数
    • 获取新闻发布时间及类型转换也整理成函数

    1.熟练运用re.search(),match(),findall()

     

     2.从新闻url获取新闻详情: 字典,anews

    import requests
    from bs4 import BeautifulSoup
    from datetime import datetime
    import re
    def click(url):
    id=re.findall('(d+)',url)[-1]
    clickurl='http://oa.gzcc.cn/api.php?op=count&id={}&modelid=80'.format(id)
    resclick=requests.get(clickurl)
    newsclick=int(resclick.text.split(".html")[-1].lstrip("('").rstrip("');"))
    return newsclick
    def newsdt(showinfo):
    newsdate=showinfo.split()[0].split(':')[1]
    newstime=showinfo.split()[1]
    newsDT=newsdate+' '+newstime
    dt=datetime.strptime(newsDT,'%Y-%m-%d %H:%M:%S')
    return dt
    def anews(url):
    newsdetail={}
    res=requests.get(url)
    res.encoding='utf-8'
    soup=BeautifulSoup(res.text,'html.parser')
    newsdetail['nenewstitle']=soup.select('.show-title')[0].text
    showinfo=soup.select('.show-info')[0].text 
    newsdetail['newsDT']=newsdt(showinfo)
    newsdetail['newsclick']=click(newsurl)
    return newsdetail
    
    def alist(listurl):
    res=requests.get(listurl)
    res.encoding='utf-8'
    soup=BeautifulSoup(res.text,'html.parser')
    newslist=[]
    for news in soup.select('li'):
    if len(news.select('.news-list-title'))>0:
    newsurl = news.select('a')[0]['href']
    newsdesc = news.select('.news-list-description')[0].text
    newsdict = anews(newsurl)
    newsdict['newsurl'] = newsurl
    newsdict['descricption'] = newsdesc
    newslist.append(newsdict)
    return newslist
    
    listurl='http://news.gzcc.cn/html/xiaoyuanxinwen/'
    alist(listurl)
    

      

    截图:

    3.从列表页的url获取新闻url:列表append(字典) alist

    4.生成所页列表页的url并获取全部新闻 :列表extend(列表) allnews

    *每个同学爬学号尾数开始的10个列表页

    res=requests.get('http://news.gzcc.cn/html/xiaoyuanxinwen/')
    res.encoding='utf-8'
    soup=BeautifulSoup(res.text,'html.parser')
    soup.select('#pages')[0].text
    
    int(re.search('..(d+).下',soup.select('#pages')[0].text).groups(1)[0])
    
    listurl='http://news.gzcc.cn/html/xiaoyuanxinwen/'
    allnews=alist(listurl)
    
    for i in range(5,15):
        listurl='http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html'.format(i)
        allnews.extend(alist(listurl))
        
    

      

    截图:

    爬取一页;

     

    爬取多页;

     

    5.设置合理的爬取间隔

    import time

    import random

    time.sleep(random.random()*3)

    import time
    import random
    
    for i in range(1,3):
        print(i)
        time.sleep(random.random()*3)#休眠随机数的3倍时间
    
    print(tenNews)
    

      

    6.用pandas做简单的数据处理并保存

    保存到csv或excel文件 

    newsdf.to_csv(r'E:gzcc.csv')
    import pandas as pd
    newsdf = pd.DataFrame(allnews)
    
    
    newsdf.to_csv(r'E:gzcc.csv')

    截图:图为(5-15页内容)

  • 相关阅读:
    感知机预测NBA总冠军
    java 一维数组
    2020-11-25
    2020-11-24学习日记
    Java语言概述
    人脸情绪识别系统---测试心得
    结对编程,问题不大
    结对编程之队友代码赏析
    项目测试心得——基于微信的图书销售小程序
    数据库设计心得
  • 原文地址:https://www.cnblogs.com/yuanzhenpeng/p/10700927.html
Copyright © 2011-2022 走看看