zoukankan html css js c++ java

爬取笔趣阁小说

《修罗武神》是在17K小说网上连载的网络小说，作者为善良的蜜蜂。小说讲述了一个少年从下界二等门派外门弟子成长为上界翘楚人物的故事。该书曾入选“第三届橙瓜网络文学奖”百强作品。

编程只是实现目的的工具。

所以重点是分析我们的需求。

获取小说目录页面是基本。这里有各个章节的链接，标题等等内容。这是我们需要的。

有了各个章节的链接，就需要进入其中获得各个章节的内容。

1.首先是爬取网站的内容

 1 def get_content(url):
 2 
 3     try:
 4         headers = {
 5             'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',
 6         }
 7 
 8         r = requests.get(url=url, headers=headers)
 9         r.encoding = 'utf-8'
10         content = r.text
11         return content
12     except:
13         s = sys.exc_info()
14         print("Error '%s' happened on line %d" % (s[1], s[2].tb_lineno))
15         return " ERROR "

2.解析内容

 1 def praseContent(content):
 2     soup = BeautifulSoup(content,'html.parser')
 3     chapter = soup.find(name='div',class_="bookname").h1.text
 4     content = soup.find(name='div',id="content").text
 5     save(chapter, content)
 6     next1 = soup.find(name='div',class_="bottem1").find_all('a')[2].get('href')
 7     # 如果存在下一个章节的链接，则将链接加入队列
 8     if next1 != '/0_638/':
 9         q.put(base_url+next1)
10     print(next1)

接下来就是完整代码

 1 import requests
 2 import time
 3 import sys
 4 import os
 5 import queue
 6 from bs4 import BeautifulSoup 
 7 # 用一个队列保存url
 8 q = queue.Queue()
 9 # 首先我们写好抓取网页的函数
10 def get_content(url):
11 
12     try:
13         headers = {
14             'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',
15         }
16 
17         r = requests.get(url=url, headers=headers)
18         r.encoding = 'utf-8'
19         content = r.text
20         return content
21     except:
22         s = sys.exc_info()
23         print("Error '%s' happened on line %d" % (s[1], s[2].tb_lineno))
24         return " ERROR "
25 
26 # 解析内容
27 def praseContent(content):
28     soup = BeautifulSoup(content,'html.parser')
29     chapter = soup.find(name='div',class_="bookname").h1.text
30     content = soup.find(name='div',id="content").text
31     save(chapter, content)
32     next1 = soup.find(name='div',class_="bottem1").find_all('a')[2].get('href')
33     # 如果存在下一个章节的链接，则将链接加入队列
34     if next1 != '/0_638/':
35         q.put(base_url+next1)
36     print(next1)
37 # 保存数据到txt
38 def save(chapter, content):
39     filename = "修罗武神.txt"
40     f =open(filename, "a+",encoding='utf-8')
41     f.write("".join(chapter)+'
')
42     f.write("".join(content.split())+'
') 
43     f.close
44 
45 # 主程序
46 def main():
47     start_time = time.time()
48     q.put(first_url)
49     # 如果队列为空，则继续
50     while not q.empty():
51         content = get_content(q.get())
52         praseContent(content)
53     end_time = time.time()
54     project_time = end_time - start_time
55     print('程序用时', project_time)
56 
57 # 接口地址
58 base_url = 'https://www.xbiquge6.com'
59 first_url = 'https://www.xbiquge6.com/0_638/1124120.html'
60 if __name__ == '__main__':
61     main()

运行得到txt文件

学习爬取小说的过程还是很困难的，但成功的收获也很值得。

伴随着一些问题的解决，对于一些基本的操作也弄清楚了。对于这些东西的最好的学习方式，就是在使用中学习，通过解决问题的方式来搞定这些知识。按需索取，才能更有针对性。

查看全文

相关阅读:
浅谈MapReduce
Redis源码分析（三十五）--- redis.c服务端的实现分析（2）
Redis源码分析（三十五）--- redis.c服务端的实现分析（2）
Redis源码分析（三十五）--- redis.c服务端的实现分析（2）
Confluence 6 手动安装语言包和找到更多语言包
 Confluence 6 安装一个语言组件
 Confluence 6 启用主题评论
 Confluence 6 启用远程 API
Confluence 6 配置时间和日期格式
 Confluence 6 创建-使用-删除快捷链接

原文地址：https://www.cnblogs.com/wt714/p/11963497.html