zoukankan html css js c++ java

小爬新浪新闻AFCCL

1.任务目标：

爬取新浪新闻AFCCL的文章：文章标题、时间、来源、内容、评论数等信息。

2.目标网页：

http://sports.sina.com.cn/z/AFCCL/

3.网页分析

4.源代码：

#!/usr/bin/env/python
# coding:utf-8
import sys
import requests
from bs4 import BeautifulSoup
import json
import re
if __name__ == '__main__':
	url = 'http://sports.sina.com.cn/z/AFCCL/'
	res = requests.get(url)
	html_doc = res.content

	soup = BeautifulSoup(html_doc, 'html.parser')

	a_list=[]
	#爬取新闻时间，标题，链接
	for news in  soup.select('.news-item'):
		if(len(news.select('h2'))>0):
			h2=news.select('h2')[0].text
			a=news.select('a')[0]['href']
			time=news.select('.time')[0].text
			# print(time,h2,a)
			a_list.append(a)
	#爬取内文资料
	for i in range(len(a_list)):
		url=a_list[i]
		res = requests.get(url)
		html_doc = res.content
		soup = BeautifulSoup(html_doc, 'html.parser')
		#获取文章标题、时间、来源、内容,评论数
		title=soup.select('#j_title')
		if title:
			title = soup.select('#j_title')[0].text.strip()
			time = soup.select('.article-a__time')[0].text.strip()
			source = soup.select('.article-a__source')[0].text.strip()
			content = soup.select('.article-a__content')[0].text.strip()
			#动态生成获取评论的Ajax url eg:'http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=ty&newsid=comos-fykiuaz1429964&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20&jsvar=loader_1504416797470_64712661'
			# print(url)
			pattern_id=r'(fykw*).s?html'
			# print(re.search(pattern_id,url).group(1))
			id=re.search(pattern_id,url).group(1)
			url='http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=ty&newsid=comos-'+id+'&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20'
			comments = requests.get(url)
			jd=json.loads(comments.text.strip('var data='))
			commentCount = jd['result']['count']['total'] # 评论数
			print(time,title,source,content)
			print(commentCount)

5.运行结果：

6.小结：

对于一次请求获得的资源爬取是比较顺利的，对于异步请求的资源需要查看检查器，寻找资源所在请求，正对性的爬取。

eg：“评论及评论数”的爬取。

作者：海哥哥

Github：https://github.com/jasonhavenD

查看全文

相关阅读:
拓扑排序学习
 快速排序+归并排序
 邻接表的两种实现（链表和数组模拟）
一起学Windows Phone7开发(十四.一 Phone Task)
一起学Windows Phone7开发(十四.四 Web Task)
一起学Windows Phone7开发(十四.三 Multimedia Task)
一起学Windows Phone7开发(十五. Device)
一起学Windows Phone7开发(十四.五 Market Task)
深入学习Windows Phone7(三. Visual State Manager)
深入学习Windows Phone7(一. Reactive Extension)

原文地址：https://www.cnblogs.com/jasonhaven/p/7469519.html