一、环境
panda、beautifulsoup4、python3.7
二、实现过程
分析网站首页代码,利用html网页中的元素id进行定位,读取所需要的数据。定义一个函数,封装当前分页中的20个网页链接。利用panda查看数据内容并保存为csv格式文件。
三、参考链接
https://blog.csdn.net/qq_42881421/article/details/84575316?utm_medium=distribute.pc_relevant.none-task-blog-baidujs_title-0&spm=1001.2101.3001.4242
四、程序源码记录
import requests import json res = requests.get('https://news.sina.com.cn/c/2018-11-15/doc-ihnvukff4194550.shtml') res.encoding = 'utf-8' #catch title from bs4 import BeautifulSoup def getNewsDetail(newsurl): result = {} # 下载页面数据 res = requests.get(newsurl) res.encoding = 'utf-8' soup = BeautifulSoup(res.text, 'html.parser') # 读取标题 result['title'] = soup.select('.main-title')[0].text # 文章内容 result['article'] = ' '.join([p.text.strip() for p in soup.select('.article p')[:-1]]) return result #catch 20 news for one time def parseListLinks(url): newsdetails = [] res = requests.get(url) jd = json.loads(res.text) #获取一个页面所有链接(20个左右) for ent in jd['result']['data']: #getNewsDetail为获取一个链接内容详情 newsdetails.append(getNewsDetail(ent['url'])) return newsdetails url = 'https://feed.sina.com.cn/api/roll/get?pageid=121&lid=1356&num=20&versionNumber=1.2.4&page={}&encode=utf-8' news_total = [] for i in range(1,3): newsurl = url.format(i) newsary = parseListLinks(url) news_total.extend(newsary) print(news_total) import pandas df = pandas.DataFrame(news_total) #查看后5行数据 df.to_csv("ruanjianbei.csv")
五、遇到的问题
暂无