zoukankan html css js c++ java

数据结构化与保存

1. 将新闻的正文内容保存到文本文件。

def writeNewsDetail(content):
    f = open('content.txt','a',encoding='utf-8')
    f.write(content)
    f.close()

2. 将新闻数据结构化为字典的列表:

单条新闻的详情-->字典news
一个列表页所有单条新闻汇总-->列表newsls.append(news)

所有列表页的所有新闻汇总列表newstotal.extend(newsls)

def getNewDetail(newsUrl):
    resd = requests.get(newsUrl)
    resd.encoding = 'utf-8'
    soupd = BeautifulSoup(resd.text, 'html.parser')
    news = {}
    news['title'] = soupd.select('.show-title')[0].text
    info = soupd.select('.show-info')[0].text
    news['dt'] = datetime.strptime(info.lstrip('发布时间:')[0:19], '%Y-%m-%d %H:%M:%S')
    if info.find('作者：') > 0:
        news['wr'] = info[info.find('作者：'):info.find('审核：')].lstrip('作者：').split()[0]
    else:
        news['wr'] = 'none'
    if info.find('摄影：') > 0:
        news['ph'] = info[info.find('摄影：'):].split()[0].lstrip('摄影：')
    else:
        news['ph'] = 'none'
    if info.find('来源：') > 0:
        news['source'] = info[info.find('来源：'):].split()[0].lstrip('来源：')
    else:
        news['source'] = 'none'
    content = soupd.select('.show-content')[0].text.strip()
    writeNewsDetail(content)
    news['click'] = getClickCount(newsUrl)
    news['newsUrl'] = newsUrl
    # print('发布时间：', dt, '标题：', title, '链接：', newsUrl, '来源：', source, '作者：', wr, '摄影：', ph, '点击次数：', click)
    return news
newsurl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/'

def getListPage(newsurl):
    res = requests.get(newsurl)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')
    newslist = []
    for new in soup.select('li'):
        if len(new.select('.news-list-title')) > 0:
            newsUrl = new.select('a')[0].attrs['href']
            newslist.append(getNewDetail(newsUrl))
    return newslist
getListPage(newsurl)


def getPageN():
    res = requests.get('http://news.gzcc.cn/html/xiaoyuanxinwen/')
    res.encoding = "utf-8"
    soup = BeautifulSoup(res.text, 'html.parser')
    n = int(soup.select('#pages')[0].select('a')[0].text.rstrip('条'))
    return (n // 10 + 1)

getListPage(newsurl)
newsTotal = []
n = getPageN()

for i in range(1,5):
    # print(i)
    listPageUrl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html'.format(i)
    newsTotal.extend(getListPage(listPageUrl))
print(newsTotal)

3. 安装pandas，用pandas.DataFrame(newstotal)，创建一个DataFrame对象df.

import pandas
df = pandas.DataFrame(newsTotal)

4. 通过df将提取的数据保存到csv或excel 文件。

df.to_excel('gzccnews.xlsx')
df.to_csv('gzccNews.csv')

5. 用pandas提供的函数和方法进行数据分析：

提取包含点击次数、标题、来源的前6行数据
提取‘学校综合办’发布的，‘点击次数’超过3000的新闻。
提取'国际学院'和'学生工作处'发布的新闻。
进取2018年3月的新闻

print(df[['click','title','source']].head(6))
print(df[(df['click']>3000)&(df['source']=='学校综合办')])
print(df[(df['source']=='国际学院')|(df['source']=='学生工作处')])
print(df1['2018-03'])

6. 保存到sqlite3数据库

import sqlite3

with sqlite3.connect('gzccnewsdb.sqlite') as db:
    df.to_sql('gzccnewsdb', con=db, if_exists='replace')

7. 从sqlite3读数据

with sqlite3.connect('gzccnewsdb.sqlite') as db:
    df2 = pandas.read_sql_query('SELECT * FROM gzccnewsdb',con=db)

8. df保存到mysql数据库

import pymysql
from sqlalchemy import create_engine

conn = create_engine('mysql+pymysql://root:123456@localhost:3306/gzccnews?charset=utf8')
pandas.io.sql.to_sql(df, 'gzccnews', con=conn, if_exists='replace')

查看全文

相关阅读:
封装成帧、帧定界、帧同步、透明传输（字符计数法、字符串的首尾填充法、零比特填充的首尾标志法、违规编码法）
计算机网络之数据链路层的基本概念和功能概述
 物理层设备（中继器、集线器）
计算机网络之传输介质（双绞线、同轴电缆、光纤、无线电缆、微波、激光、红外线）
计算机网络之编码与调制
 0953. Verifying an Alien Dictionary (E)
1704. Determine if String Halves Are Alike (E)
1551. Minimum Operations to Make Array Equal (M)
0775. Global and Local Inversions (M)
0622. Design Circular Queue (M)

原文地址：https://www.cnblogs.com/cgq520/p/8876774.html