zoukankan      html  css  js  c++  java
  • 爬取全部的校园新闻

    1.从新闻url获取新闻详情: 字典,anews

    2.从列表页的url获取新闻url:列表append(字典) alist

    3.生成所页列表页的url并获取全部新闻 :列表extend(列表) allnews

    *每个同学爬学号尾数开始的10个列表页

    4.设置合理的爬取间隔

    import time

    import random

    time.sleep(random.random()*3)

    5.用pandas做简单的数据处理并保存

    保存到csv或excel文件 

    newsdf.to_csv(r'F:duym爬虫gzccnews.csv')

    保存到数据库

    import sqlite3
    with sqlite3.connect('gzccnewsdb.sqlite') as db:
        newsdf.to_sql('gzccnewsdb',db)

    import requests
    from bs4 import BeautifulSoup
    from datetime import datetime
    import re
    import sqlite3
    import pandas as pd
    import time
    import pandas
    import random
     
    def click(url):
        id=re.findall('d[1,5]',url)[-1]
        clickUrl='http://oa.gzcc.cn/api.php?op=count&id={}&modelid=80'.format(id)
        resClick=requests.get(clickUrl)
        newsClick=int(resClick.text.split('.html')[-1].lstrip("('").rstrip("');"))
        return newsClick
    
    def newsdt(showinfo):
        newsDate=showinfo.split()[0].split(':')[1]
        newsTime=showinfo.split()[1]
        newsDT=newsDate+' '+newsTime
        dt=datetime.strptime(newsDT,'%Y-%m-%d %H:%M:%S')
        return dt
    
    def anews(url):
        newsDetail={}
        res=requests.get(url)
        res.encoding='utf-8'
        soup=BeautifulSoup(res.text,'html.parser')
        newsDetail['newsTitle']=soup.select('.show-title')[0].text
        showinfo=soup.select('.show-info')[0].text
        newsDetail['newsDT']=newsdt(showinfo)
        newsDetail['newsClick']=click(url)
        return newsDetail
     
    def alist(url):
        res=requests.get(listUrl)
        res.encoding='utf-8'
        soup=BeautifulSoup(res.text,'html.parser')
        newsList=[]
        for news in soup.select('li'):
            if len(news.select('.news-list-title'))>0:
                newsUrl=news.select('a')[0]['href']
                newsDest=news.select('.news-list-description')[0].text
                newsDict=anews(newsUrl)
                newsDict['description']=newsDest
                newsList.append(newsDict)
        return newsList
     
     
    #url= 'http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0404/11155.html'
    #anews(url)
     
    url = 'http://news.gzcc.cn/html/xiaoyuanxinwen'
    res=requests.get('http://news.gzcc.cn/html/xiaoyuanxinwen')
    res.encoding='utf-8'
    soup=BeautifulSoup(res.text,'html.parser')
    for news in soup.select('li'):
        if len(news.select('.news-list-title'))>0:
            newsUrl=news.select('a')[0]['href']
            
    allnews = []
    for i in range(2,12):
        listUrl='http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html'.format(i)
        allnews.extend(alist(listUrl))  
    
    pd.Series(allnews)#保存文件
    newsdf=pd.DataFrame(allnews)
    newsdf.sort_index(by=['newsClick'],ascending=False)
    newsdf.to_csv(r'D:lyj.csv')
    with sqlite3.connect('gzccnewsdb.sqlite')as db:
        df2=pandas.read_sql_query('SELECT * FROM gzccnewsdb',con=db)
    df2[df2['newsClick']>350]
    

  • 相关阅读:
    智能移动机器人背后蕴含的技术——激光雷达
    Kalman Filters
    Fiddler抓HttpClient的包
    VSCode开发WebApi EFCore的坑
    WPF之小米Logo超圆角的实现
    windows react打包发布
    jenkins in docker踩坑汇总
    Using ML.NET in Jupyter notebooks 在jupyter notebook中使用ML.NET ——No design time or full build available
    【Linux知识点】CentOS7 更换阿里云源
    【Golang 报错】exec gcc executable file not found in %PATH%
  • 原文地址:https://www.cnblogs.com/lamonein/p/10698793.html
Copyright © 2011-2022 走看看