本次作业的要求来自:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/2894
目标:给定一篇新闻的链接newsUrl,获取该新闻的全部信息:标题、作者、发布单位、审核、来源、发布时间、点击次数等,并把整个过程包装成一个简单清晰的函数。
源代码如下:
1 import requests 2 from bs4 import BeautifulSoup 3 from datetime import datetime 4 import re 5 6 #获取新闻的全部信息 7 def newsInfo(url): 8 news = requests.get(url) 9 news.encoding = 'utf-8' 10 newSoup = BeautifulSoup(news.text,'html.parser') 11 #标题 12 title = newSoup.select('.show-title')[0].text 13 print('标题:'+title) 14 #发布信息 15 newInfo = newSoup.select('.show-info')[0].text 16 #发布时间 17 newDate = newInfo.split()[0].lstrip('发布时间:') 18 newTime = newInfo.split()[1] 19 newDateTime = newDate+' '+newTime 20 21 print('发布时间:'+newDateTime) 22 #作者 23 author = newInfo.split()[2] 24 print(author) 25 #审核 26 examine = newInfo.split()[3] 27 print(examine) 28 #来源 29 source = newInfo.split()[4] 30 print(source) 31 32 #获取点击次数的url 33 id = re.findall('(d{1,7})',url)[-1] 34 clickUrl = 'http://oa.gzcc.cn/api.php?op=count&id={}&modelid=80'.format(id) 35 click = requests.get(clickUrl) 36 newClick = int(click.text.split('.html')[-1].lstrip("('").rstrip("');")) # 获取点击次数 37 print('点击次数:') 38 print(newClick) 39 40 #内容 41 newContent = newSoup.select('.show-content')[0].text 42 print('内容:'+newContent) 43 return; 44 45 url = 'http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0329/11095.html' 46 newsInfo(url)
运行结果如图:
![]() |